robust sound localization using conditional time–frequency histograms

12
Robust sound localization using conditional time–frequency histograms Parham Aarabi * , Sam Mavandadi Department of Electrical and Computer Engineering, University of Toronto, 10 Kings College Rd., Toronto, Ont., Canada M5S 3G4 Received 17 September 2001; received in revised form 3 August 2002; accepted 19 September 2002 Abstract A mechanism for robust time-delay of arrival (TDOA) estimation is proposed using time–frequency histograms (TFHs). By dividing the signals of two different microphones into time–frequency blocks, TDOA estimates are obtained from the block phase information and fused using a histogram. It is shown that TFHs are related to the traditional cross-correlation techniques except that are less lenient on large phase errors. It is also shown that a weighted or conditional TFH technique can achieve better results than a traditional histogram with unit incrementing functions. Conditional TFHs are explored by the introduction of three his- togram incrementing functions, among which the magnitude product incrementing function (TFH-UCC) results in the smallest direction-of-arrival root-mean-square errors (RMSE) in the 0–10 dB signal-to-noise ratio (SNR) range when the noise consists of a second speaker. For Gaussian noise, or for high SNR (P 10 dB) speech noise, the generalized cross-correlation with the maximum- likelihood weighting function has the lowest RMSE. Ó 2003 Elsevier Science B.V. All rights reserved. Keywords: Microphone array; Information fusion; Time–frequency analysis; Histogram fusion; Array signal processing 1. Introduction In many applications, the localization of a sound source is required. For example, in order to point a camera toward a speaker in a lecture hall, or to form a beam pattern in order to filter the unwanted noises and focus on the speaker of interest, the location of the speaker must first be obtained. Sound localization may be achieved by means of an array of microphones [2–7,9]. While different techniques such as MUSIC [14] and maximum likelihood (ML) [15] exist, most typical sound localization systems utilize (due to the complexity and the requirements of MUSIC and ML) the time-delay of arrival (TDOA) between different microphones. A single TDOA constrains the possible sound source locations to a hyperbola in 2-D, or a hyperboloid in 3-D. By obtaining TDOA estimates from multiple microphone pairs, multiple hyperbolae are obtained whose intersection in space corresponds to the sound source position. Usually, since TDOA esti- mates may be noisy, there can often be several hyper- bolae intersections. However, these intersections usually occur near the vicinity of the correct source position, and hence can be used as a means of localizing the source positions. In order to obtain an estimate for the TDOA(s) of different sources at two different microphones, a variety of techniques can be used. Examples include the gener- alized cross-correlation technique [1,6], a variety of histogram techniques [3,9,10,13], and techniques based on neural networks [11,12]. Generally, cross-correlation based techniques are better suited in single-source situ- ations where high estimation accuracy is required [2– 6,9]. In contrast, histogram techniques, which often use cross-correlations on a variety of short signals in order to obtain multiple TDOA estimates, can deal with a larger number of sound or noise sources at a cost of lower accuracy [2,3,9]. Other techniques such as neural network based [7] or zero-crossing techniques also exist [10,13], but are often not used because of their relative inaccuracy and/or the increased computational cost. However, in practical environments, the robustness of histograms is often preferred to the accuracy of less robust techniques [2,17,18]. As a result, several inter- esting histogram algorithms have been proposed in the past few years, including iterative histogram generation * Corresponding author. E-mail addresses: [email protected] (P. Aarabi), mavanda@ ecf.utoronto.ca (S. Mavandadi). 1566-2535/03/$ - see front matter Ó 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S1566-2535(03)00003-4 Information Fusion 4 (2003) 111–122 www.elsevier.com/locate/inffus

Upload: parham-aarabi

Post on 05-Jul-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Robust sound localization using conditional time–frequency histograms

Robust sound localization using conditionaltime–frequency histograms

Parham Aarabi *, Sam Mavandadi

Department of Electrical and Computer Engineering, University of Toronto, 10 Kings College Rd., Toronto, Ont., Canada M5S 3G4

Received 17 September 2001; received in revised form 3 August 2002; accepted 19 September 2002

Abstract

A mechanism for robust time-delay of arrival (TDOA) estimation is proposed using time–frequency histograms (TFHs). By

dividing the signals of two different microphones into time–frequency blocks, TDOA estimates are obtained from the block phase

information and fused using a histogram. It is shown that TFHs are related to the traditional cross-correlation techniques except

that are less lenient on large phase errors. It is also shown that a weighted or conditional TFH technique can achieve better results

than a traditional histogram with unit incrementing functions. Conditional TFHs are explored by the introduction of three his-

togram incrementing functions, among which the magnitude product incrementing function (TFH-UCC) results in the smallest

direction-of-arrival root-mean-square errors (RMSE) in the 0–10 dB signal-to-noise ratio (SNR) range when the noise consists of a

second speaker. For Gaussian noise, or for high SNR (P 10 dB) speech noise, the generalized cross-correlation with the maximum-

likelihood weighting function has the lowest RMSE.

� 2003 Elsevier Science B.V. All rights reserved.

Keywords: Microphone array; Information fusion; Time–frequency analysis; Histogram fusion; Array signal processing

1. Introduction

In many applications, the localization of a sound

source is required. For example, in order to point a

camera toward a speaker in a lecture hall, or to form abeam pattern in order to filter the unwanted noises and

focus on the speaker of interest, the location of the

speaker must first be obtained.

Sound localization may be achieved by means of an

array of microphones [2–7,9]. While different techniques

such as MUSIC [14] and maximum likelihood (ML) [15]

exist, most typical sound localization systems utilize

(due to the complexity and the requirements of MUSICand ML) the time-delay of arrival (TDOA) between

different microphones. A single TDOA constrains the

possible sound source locations to a hyperbola in 2-D,

or a hyperboloid in 3-D. By obtaining TDOA estimates

from multiple microphone pairs, multiple hyperbolae

are obtained whose intersection in space corresponds to

the sound source position. Usually, since TDOA esti-

mates may be noisy, there can often be several hyper-

bolae intersections. However, these intersections usually

occur near the vicinity of the correct source position,

and hence can be used as a means of localizing the

source positions.

In order to obtain an estimate for the TDOA(s) ofdifferent sources at two different microphones, a variety

of techniques can be used. Examples include the gener-

alized cross-correlation technique [1,6], a variety of

histogram techniques [3,9,10,13], and techniques based

on neural networks [11,12]. Generally, cross-correlation

based techniques are better suited in single-source situ-

ations where high estimation accuracy is required [2–

6,9]. In contrast, histogram techniques, which often usecross-correlations on a variety of short signals in order

to obtain multiple TDOA estimates, can deal with a

larger number of sound or noise sources at a cost of

lower accuracy [2,3,9]. Other techniques such as neural

network based [7] or zero-crossing techniques also exist

[10,13], but are often not used because of their relative

inaccuracy and/or the increased computational cost.

However, in practical environments, the robustnessof histograms is often preferred to the accuracy of less

robust techniques [2,17,18]. As a result, several inter-

esting histogram algorithms have been proposed in the

past few years, including iterative histogram generation

*Corresponding author.

E-mail addresses: [email protected] (P. Aarabi), mavanda@

ecf.utoronto.ca (S. Mavandadi).

1566-2535/03/$ - see front matter � 2003 Elsevier Science B.V. All rights reserved.

doi:10.1016/S1566-2535(03)00003-4

Information Fusion 4 (2003) 111–122

www.elsevier.com/locate/inffus

Page 2: Robust sound localization using conditional time–frequency histograms

[2], histogram-based classification [17], and restrictedhistograms [18]. The technique proposed by [17] trained

histograms for different direction of arrivals (DOAs) of

the source and used them to classify the corresponding

DOA of a sound segment. Two techniques, including

histogram matching and ML, were proposed. The his-

togram matching resulted in the most accurate sound

localization system although it required 2 s per local-

ization.In this paper we extend the idea of histogram-based

TDOA estimators to time–frequency histograms (TFHs).

The goal here is twofold. First, in situations where dif-

ferent sources are not temporally or spectrally overlap-

ping, TFHs would result in histograms that distinguish

between the different sources. Second, by computing

histograms using time–frequency blocks instead of just

time blocks, a much greater number of blocks becomeavailable resulting in greater robustness.

2. Single segment TDOA estimation using cross-

correlation

There are many different algorithms that attempt to

estimate the most likely TDOA between a pair of ob-servers [4–8]. Usually, these algorithms have a heuristic

measure that estimates the likelihood of each TDOA

and then selects the most likely one.

A simple model of the signal received by two micro-

phones can be defined as [6]:

x1ðtÞ ¼ h1ðtÞsðtÞ þ n1ðtÞ ð1Þx2ðtÞ ¼ h2ðtÞsðt þ sÞ þ n2ðtÞ ð2ÞBasically, both microphones receive a time-delayed

version of the source signal sðtÞ, through channels with

possibly different impulse responses h1ðtÞ and h2ðtÞ, andwith microphone dependent noise signals n1ðtÞ and n2ðtÞ.Generally, the noise sources can be assumed to be un-

correlated, and the impulse responses are assumed to be

fixed for given microphone array and speaker positions.However, these assumptions will not always be true, and

as a result, techniques have been developed that can to a

certain extent tolerate the failure of these assumptions

[6]. The main problem is to estimate s given the micro-

phone signals x1ðtÞ and x2ðtÞ. The most common solu-

tion to this problem is the single segment generalized

cross-correlation approach shown below [8]:

s� ¼ argmaxb

Z 1

�1W ðxÞX1ðxÞX2ðxÞejxb dx ð3Þ

where s� is an estimate of the original source signal delaybetween the two microphones. The actual choice of the

weighing function W ðxÞ has been studied in detail for

general sound and speech sources, and three different

choices, the ML [6,7], the phase transform [6,8], and the

unfiltered cross-correlation [7,8] are shown below:

WMLðxÞ ¼ jX1ðxÞjjX2ðxÞjjN1ðxÞj2jX2ðxÞj2 þ jX1ðxÞj2jN2ðxÞj2

ð4Þ

WPHATðxÞ ¼ 1

jX1ðxÞX2ðxÞj ð5Þ

WUCCðxÞ ¼ 1 ð6ÞThe ML weights require knowledge about the spectrum

of the microphone dependent noises. The phase trans-

form does not require this knowledge, and hence hasbeen employed more often due to its simplicity. The

unfiltered cross-correlation does not utilize any weighing

function.

3. Multi-segment TDOA estimation using the cross-

correlation

In situations where a large speech segment has beenrecorded by two microphones, the standard version of

the generalized cross-correlation can be applied [8]. It

should be noted that the version of the generalized cross-

correlation presented in Eqs. (3), (5), (6) was only a

single segment approximation of the generalized cross-

correlation technique shown below:

s� ¼ argmaxb

Z 1

�1bWW ðxÞbGGX1X2ðxÞejxb dx ð7Þ

where bWW ðxÞ is a weighting function obtained from the

estimated cross-power spectral density and the estimatedpower spectral density and bGGX1X2ðxÞ is the estimated

cross-power spectral density of the signals X1ðxÞ and

X2ðxÞ. Note that by estimated cross-power spectral

density, it is meant that the signals are divided into small

segments (typically 20 ms long), with each segment

being processed by a Hanning window, and the average

of the cross-power spectral densities of all the segments

is then taken and used as the estimated cross-powerspectral density. A similar procedure was performed for

estimating the power spectral density. The ML and the

PHAT weighting functions corresponding to this case

are defined as [8]:

bWWMLðxÞ ¼ jbGGX1X2ðxÞjjbGGX1X1ðxÞbGGX2X2ðxÞj � jbGGX1X2ðxÞj2

ð8Þ

bWWPHATðxÞ ¼ 1

jbGGX1X2ðxÞjð9Þ

where bGGX1X1ðxÞ and bGGX2X2ðxÞ are the estimated power

spectral densities of X1ðxÞ and X2ðxÞ, respectively.In this paper, when we compare the generalized cross-

correlation techniques for long segments, we will utilize

the method presented here. For single segment esti-mates, such as those required for histograms, the single

segment methods of the previous section will be used.

112 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122

Page 3: Robust sound localization using conditional time–frequency histograms

4. TDOA fusion using histograms

Often, due to the presence of multiple sound sources,

it is necessary to obtain both the number and the like-

lihood of different TDOAs by estimating their proba-

bility distribution function. In practical situations, the

simplest method of estimating the probability distribu-

tion is to obtain multiple TDOA estimates that can be

fused using a histogram. In the limit, as the number ofTDOA estimates approaches infinity (while the sound

sources remain stationary), the normalized TDOA his-

togram will approach the distribution of the TDOAs.

Assuming that a TDOA histogram bin is centered at

TDOA s, has width 2e, and a total of N TDOA esti-

mates are available with TDOA result si for the ith es-

timate, the histogram bin hðsÞ can be defined as:

hðsÞ ¼ 1

N

XNi¼1

Z sþe

s�edðt � siÞdt ð10Þ

By the law of large numbers, assuming N is large, the

summation over i in the limit becomes equivalent to theexpectation, which, once simplified, reduces to:

hðsÞ Z sþe

s�efTDOAðtÞdt ð11Þ

where fTDOAðtÞ is the probability density function of the

TDOAs. In the limit, as e approaches 0, and as long asthe histograms are normalized (constant scaling of the

entire histogram), the histogram will be proportional to

the probability density function of the TDOAs.

4.1. Frequency histograms

One method of obtaining multiple TDOA estimates

and then combining them in a histogram is by using a

frequency-band-selected TDOA histogram [10,13]. Thisapproach divides up the spectrum of the signal into

multiple frequency bands with each band selecting a

single TDOA for histogram fusion. In order to obtain a

histogram that is representative of the distribution of the

TDOAs, many frequency bands are required (Fig. 1).

Because the phase of each of the frequency blocks

wraps at 2p, depending on the time-delay and the fre-

quency of the block, it is sometimes impossible to obtaina single TDOA estimate. This is more problematic at

higher frequencies, where the phase wrapping phenom-

enon can affect even small TDOAs. In these situations,

we can extract all possible TDOAs so that instead of

having a single selection for each frequency block, we

have multiple TDOAs, only one of which is correct. This

involves utilizing all of the peaks of the correlation

function where for each peak the corresponding TDOAhistogram bin is incremented. This can be viewed as a

histogram aliasing phenomenon.

Frequency histograms are effective when different

sound sources have mutually exclusive frequency com-ponents [10,13]. However, when several sources have the

same spectral contents, then frequency histograms are

often incapable of estimating the TDOAs of all sound

sources. For this reason, frequency histograms are

generally not used in practice.

However, frequency related techniques such as the

one proposed in [18], which basically proposed a multi-

frequency onset histogram with certain restrictions, havebeen shown to be able to localize sound source direc-

tions to within a few degrees. Nevertheless, by only

utilizing single frequency bands, they have less potential

at resolving multiple speakers as TFHs. It should be

noted that while the technique of [18] did compute

frequency histograms, it combined multiple frequency

histograms and sequential time segments by using a

Gaussian smoothing function. Hence, in effect, [18] didcombine time–frequency information, although the his-

tograms where computed across frequencies only.

4.2. Temporal histograms

A more commonly used histogram technique is tem-

poral histograms, also known as iterative spatial prob-

ability [3] or temporal power fusion [9] based TDOA

estimation. The idea with temporal histograms is that, in

the case of multiple speech sources, not all sources will

be simultaneously active. As a result, the periods duringwhich a source is silent are used to collect information

regarding the other sound sources, and vice versa. This

is accomplished by temporally breaking up the speech

Fig. 1. TDOA estimates corresponding to different frequency blocks

are incorporated into a TDOA histogram in order to approximate the

TDOA distribution.

P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 113

Page 4: Robust sound localization using conditional time–frequency histograms

signals into small segments (typically 20 ms long) whose

TDOA estimates (obtained by a weighted cross-corre-

lation) are incorporated into a histogram, as shown in

Fig. 2. By obtaining many speech segments, numerous

TDOA estimates become available resulting in a histo-gram that is representative of the TDOA distribution.

In order to analyze the effectiveness of the temporal

histogram TDOA estimation technique, a simulation

with two male speakers was performed. Assuming that

one of the sources is considered to be the signal of

interest (with a TDOA of )5 samples using a 20 kHz

sampling rate) and the other considered as noise (TDOA

of 5 samples), the signal-to-noise ratio (SNR) was variedfrom 30 to )30 dB, with the normalized temporal his-

togram corresponding to 800 ms speech segments plot-

ted in Fig. 3. As shown, above 0 dB the histogram has a

dominant peak at the signal TDOA ()5) and below 0 dB

the histogram shows a dominant peak at the noise

TDOA (5).

Time-histograms are often used successfully in prac-tice. One interesting implementation was that of [17]. In

[17] a histogram training phase was proposed, where, for

a finite number of direction-of-arrivals, a TDOA histo-

gram was produced (using multiple time segments from

a long sound signal segment). The resulting histograms

were then compared using a chi-squared metric to all the

histograms in the database for the different DOAs. The

DOA corresponding to the most similar histogramwould then be identified as the DOA of the segment.

This DOA estimation strategy results in a slow local-

ization rate (0.5 Hz). An alternative strategy was also

proposed by [17] which, by training histograms, would

compute the ML direction of arrival from a single

speech frame. This resulted in a substantial increase in

the localization rate, although the increase was at a cost

of a less accurate sound localization system.It should be noted that the ideas proposed in [17],

which are quite effective at dealing with reverberations,

are not a new histogram technique but an extension of

time-histogram techniques. Hence, the idea of TFHs

that will be introduced in the next section can be ex-

tended in a similar fashion to that of [17].

5. Time–frequency histograms

Having seen both temporal and frequency histo-grams, it becomes clear that the two techniques can be

combined to produce TFHs. This involves breaking up

the signals both temporally and spectrally, and then

incorporating the resulting TDOA of each block into a

histogram, as shown in Fig. 4.

As mentioned previously, temporal histograms work

well when multiple sound sources are not temporally

overlapping. Although speech sources usually occupythe same general frequencies and can often be tempo-

rally overlapping, in many cases, due to the differences

in the vowels spoken by different speakers, multiple

speech sources tend to occupy different frequency blocks

Fig. 2. TDOA estimates corresponding to different temporal blocks are

incorporated in a temporal histogram.

Fig. 3. Temporal histograms as a function of SNR. Note that white corresponds to higher probability.

114 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122

Page 5: Robust sound localization using conditional time–frequency histograms

at different times. Hence, TFHs should surpass the

performance of temporal histograms.

The speech signals used in the simulation of Fig. 3

were used once again in order to simulate TFHs atdifferent SNRs. As shown in Fig. 5, the TFHs are much

cleaner than their temporal counterparts. This is mainly

because many more TDOA estimates are available in the

TFH case than the temporal case, and also because the

TFH technique allows for the spectral separation of

different speech sources unlike the temporal case. Note

that the symmetrical nature of the histograms with re-

spect to the SNRs is maintained in the TFH case as inthe temporal case.

6. Conditional TF histograms

One issue regarding frequency and TFHs is the

amount of contribution that is made by each frequency

or each time–frequency block. With time-histograms, itis reasonable to assume that the TDOA estimate of each

block should be equally significant; therefore the con-

tributions to the histogram would be the same. In other

words, every time a TDOA estimate is obtained from a

signal segment, the corresponding TDOA bin of the

histogram is incremented by a constant value of 1.

However, should a frequency or time–frequency block

at 1 kHz be as important as one at 5 or 10 kHz? Due tothe spectral characteristics of speech, a frequency de-

pendent histogram incrementing technique might be

more appropriate than equal contributions from all

frequencies.

6.1. TFH with binary priors

The simplest block dependent histogram increment-

ing technique is a binary prior likelihood model [1,16].

Here, either a block is used in the histogram or it is not

used. For example, it would make sense for the sound

signal whose spectrogram is shown in Fig. 6 to only usethe blocks in which the main signal is dominant (darker

Fig. 4. TDOA estimates from time–frequency blocks are incorporated

into a time–frequency histogram.

Fig. 5. Time–frequency histograms as a function of SNR.

Fig. 6. Blocks used for a conditional TFH corresponding to a voiced

speech segment. The dark circles correspond to blocks used.

P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 115

Page 6: Robust sound localization using conditional time–frequency histograms

color), or in other words, to use a block that lies on theformants of the speech signal. This technique would be

similar to the formant phase transform that has been

utilized with success in recent years [1,16]. Of course,

this requires prior knowledge regarding the location of

the formants.

The same 800 ms sound signal that was used to

generate the SNR dependent histograms of Figs. 3 and 5

was also used to generate conditional TFH with binarypriors. Here, the formant structure (only F1 and F2) of

the speech source of interest (with a )5 sample TDOA)

is used to generate the binary block likelihoods. The

resulting SNR dependent histograms, shown in Fig. 7,

illustrate the advantage of conditional TFH, since the

entire histogram now shows a consistent peak at )5samples, even at SNRs as low as )20 dB. Note that theseresults are obtained from [16], and that a more detaileddiscussion of binary priors pertaining to speech sources

appears in [16].

6.2. TFH with multi-valued priors

Although binary priors for conditional TFHs are in

some cases easy to obtain, it is usually more appropriate

to use weights that are not just 0 or 1, but can have a

range of values. Here, we modify the histogram incre-

menting function of Eq. (10) as follows:

hðsÞ ¼ 1

N

XNi¼1

cðiÞZ sþe

s�edðt � siÞdt ð12Þ

where cðiÞ is a scaling block for the ith time–frequencyblock.

The goal here is to weigh the TDOA estimates of

the different frequency blocks in an �optimal� manner.We will now attempt to find information about the

TDOA incrementing weights from the generalized cross-

correlation of Eq. (3). By noticing that the cross-

correlation is always real, the discrete-frequency version

of Eq. (3) can be stated as:

s� ¼ argmaxb

Xx

W ðxÞjX1ðxÞjjX2ðxÞj cosð\X1ðxÞ

� \X2ðxÞ þ xbÞ ð13Þ

where s� is the cross-correlation estimate of the TDOA.

This cross-correlation can be viewed as the maximiza-

tion of a likelihood function for the TDOA, as follows:

s� ¼ argmaxb

‘ðbÞ ð14Þ

where,

‘ðbÞ ¼X

x

W ðxÞjX1ðxÞjjX2ðxÞj cosð\X1ðxÞ

� \X2ðxÞ þ xbÞ ð15Þ

Now, we notice that the likelihood function is composed

of a weighting function WðxÞ and a selector function

nðxÞ as shown below:

‘ðbÞ ¼X

x

WðxÞnðxÞ ð16Þ

where,

WðxÞ ¼ W ðxÞjX1ðxÞjjX2ðxÞj ð17ÞnðxÞ ¼ cosð\X1ðxÞ � \X2ðxÞ þ xbÞ ð18Þ

Hence, the cross-correlation likelihood function is

composed of a cosine phase error selector function. The

phase error is a term that was initially proposed by [19],

and is defined as \X1ðxÞ � \X2ðxÞ þ xb. Basically,

when b is close to the correct time-delay, the phase error

will be small, and when b is far from the correct time-

delay, the phase error will be large. Consequently, theselector function will act as a phase error punish–reward

technique. The generalized cross-correlation, with its

cosine selector function, is lenient on larger phase errors.

As shown in Fig. 8, a sharper selector function can lead

to a better TDOA estimate because of its greater at-

tenuation of larger phase errors.

The rectangular selector function of Fig. 8 is defined

as:

Fig. 7. SNR dependent conditional TFHs using prior knowledge regarding the formant structure (F1 and F2) of the sound signal of interest.

116 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122

Page 7: Robust sound localization using conditional time–frequency histograms

nðxÞ ¼ 1 j\X1ðxÞ � \X2ðxÞ þ xbj6 e0 j\X1ðxÞ � \X2ðxÞ þ xbj > e

�ð19Þ

where e is a possibly frequency dependent width of the

rectangular selector function. This rectangular selectorfunction in fact makes the generalized cross-correlation

likelihood equivalent to a TFH (in the case that e is

equal to a constant a times the frequency x). By using

the sharper selector function, Eq. (18) becomes:

‘ðbÞ ¼X

8x;j\X1ðxÞ�\X2ðxÞþxbj6 e

W ðxÞjX1ðxÞjjX2ðxÞj ð20Þ

which is simply a weighted count of all time–frequencyblocks whose phase difference corresponds to a TDOA

equal to b. Here, each block is counted in the histogram

bin centered at b as long as:

�b � a <\X1ðxÞ � \X2ðxÞ

x< �b þ a ð21Þ

where a is half the width of each of the TDOA histo-

gram bins. Eq. (21) allows us to build a histogram

completely from the phase information (and amplitude

information in case of the UCC or the ML incrementing

functions defined below) without the need for any cross-

correlation.The above analysis illustrates that the TDOA likeli-

hood function derived from the cross-correlation is in

fact equivalent to a TFH with a weighted counter. This

counter, or histogram incrementing function for the ithtime–frequency block, is defined as:

cðiÞ ¼ W ðxÞjXi;1jjXi;2j ð22ÞThis allows us to use the weighing functions derived for

the cross-correlation. Hence, we propose the followingthree histogram incrementing functions:

cMLðiÞ ¼jXi;1j2jXi;2j2

jNi;1j2jXi;2j2 þ jXi;1j2jNi;2j2ð23Þ

cUCCðiÞ ¼ jXi;1jjXi;2j ð24Þ

cPHATðiÞ ¼ 1 ð25Þ

The phase transform incrementing function is equivalent

to traditional histograms with unit incrementing func-

tions.

7. Histogram smoothing

Histograms are often noisy, due to the limited num-

ber of blocks (be it time blocks, frequency blocks, or

time–frequency blocks) that are available. An example isFig. 8. A cosine phase error selector function, which is lenient on large

phase errors, and a sharp rectangular phase error selector function.

Fig. 9. A noisy histogram where the correct TDOA of )10 samples

cannot be obtained by taking the TDOA at the histogram maximum.

Fig. 10. A Gaussian smoothing function.

P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 117

Page 8: Robust sound localization using conditional time–frequency histograms

shown in Fig. 9, where the correct TDOA of )10 sam-ples corresponds to a sharp peak. However, there is a

secondary peak at )7 samples whose histogram value is

the global maximum. As a result, a simple peak selection

technique would mistakenly choose the )7 samples

TDOA over the correct )10 samples TDOA.

This situation can be resolved with a histogram fil-

tering or smoothing function. By convolving the noisy

histogram of Fig. 9 with a smoothing function, such asthe Gaussian one in Fig. 10, it is possible to estimate the

correct TDOA much more robustly. For example, the

convolution of Figs. 9 and 10, whose result is shown in

Fig. 11, now has a single dominant peak at )10 samples.This histogram smoothing operation is performed in all

the results that are discussed in this paper. The main

reason for smoothing the histograms, versus using a

larger bin size, is that the histogram smoothing can re-sult in a higher accuracy than a wide-bin histogram.

Alternatively, an iterative histogram technique where

the bin sizes gradually decrease could also be used.

8. Simulation results

In order to assess the performance of the proposed

TFHs, several simulations were performed. All of the

simulations used 100 500 ms speech segments obtained

from a male speaker sampled at 44.1 kHz. For the

histogram-based techniques, the 500 ms segments where

sectioned to 100 20 ms sub-segments with 15 ms over-lapping sections. For the standard cross-correlation

techniques, the generalized cross-correlation was per-

formed on the entire 500 ms (using 20 ms sub-segments

for the cross-power spectral density and the power

spectral density estimation). In all cases, before the

processing of each sub-segment, a Hanning window was

used as the windowing function.

In the simulations, a 15 cm inter-microphone distancewas used and the main speaker was placed as shown in

Fig. 12. After the microphone signals were received,

independent white Gaussian noise was added to each

channel. For each speech segment and at each SNR, the

simulation was repeated 10 times in order to average the

randomness in the Gaussian noise signals.

The DOA root-mean-square error (RMSE) measure

in degrees was used to identify the relative error of each

Fig. 11. The result of the convolution of the noisy histogram of Fig. 9

with the histogram smoothing function of Fig. 10. Now, the correct

TDOA of )10 samples can be easily selected from the smoothed his-

togram.

Fig. 12. The setup for the first set of simulations.

Table 1

The root-mean-square DOA error for all of the techniques at SNRs from 0 to 20 dB

PHAT ML TH-UCC TH-PHAT TFH-PHAT TFH-UCC TFH-ML

0 dB 8.1497 2.5553 7.5568 11.8379 6.5897 4.4889 11.2817

2 dB 4.6862 2.0556 5.383 8.6119 4.6874 2.478 9.5758

4 dB 4.0377 1.4581 6.0438 9.7923 3.1758 2.4596 10.082

6 dB 3.0091 0.8297 5.3288 5.7857 4.2217 1.1882 8.5638

8 dB 2.5254 0.6013 0.9072 5.5829 2.7175 1.1878 9.2349

10 dB 1.6959 0.4731 0.8735 4.0009 2.6096 0.9382 4.8139

12 dB 1.4912 0.2719 0.8492 1.4289 1.2295 0.8484 6.3338

14 dB 1.3637 0.1997 3.7948 1.189 0.8217 0.8968 7.9343

16 dB 1.1646 0.1287 0.7685 1.1364 0.9488 0.7601 6.5276

18 dB 0.9072 0.0922 0.7783 1.038 0.9756 0.7925 4.0922

20 dB 0.8025 0.0582 0.703 0.9591 0.8257 0.7991 0.931

118 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122

Page 9: Robust sound localization using conditional time–frequency histograms

of the techniques at a given SNR. This error can beobtained from the estimate time-delay s as follows:

Dh ¼ arcsinsvd

� �� 30� ð26Þ

where v is the speed of sound in air (about 345 m/s), d isthe inter-microphone distance (in this case 0.15 m), and

Dh is the error in the DOA. By obtaining many different

DOA errors for the different 500 ms segments and for

different random noises that were added to the signals of

each channel, an overall root-mean-square DOA error

was obtained for each technique at each given SNR. The

results for all of the techniques are shown in Table 1.

Clearly, the most accurate TDOA estimation tech-nique at all SNRs is the ML based generalized cross-

correlation technique followed by the TFH technique

with the UCC incrementing function (TFH-UCC),

which performs better than the other techniques (except

the ML based cross-correlation) at lower SNRs. The

best TFH technique (TFH-UCC) is compared to the

traditional cross-correlation techniques in Fig. 13.

As shown in Fig. 13, for low SNRs (<10 dB) theTFH-UCC technique has an RMSE that is half that of

the next best cross-correlation based TDOA estimation

technique. For higher SNRs, cross-correlation based

techniques have a smaller RMSE, partly because of the

discretization introduced by the finite-width of the his-

togram bins. In Fig. 13, the PHAT, ML, and UCC

techniques involve performing a single cross-correlation

with the corresponding weighing function on the entire500 ms speech segment. The TFH-UCC technique

obtains time-delay estimates from the time–frequency

blocks of the large speech segment and then utilizes a

histogram in order to fuse the obtained results.

Fig. 14 compares the TFH-UCC technique with other

time-histograms. Here, the TH-UCC technique involves

Fig. 13. Comparison of traditional cross-correlation based TDOA

estimators along with the TFH-UCC histogram technique.

Fig. 14. Root-mean-square error of time-histogram techniques and the

TFH-UCC technique.

Fig. 15. Root-mean-square errors for three different time–frequency

histogram techniques.

Fig. 16. A setup with two speakers used for the simulations of Figs. 17–

19.

P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 119

Page 10: Robust sound localization using conditional time–frequency histograms

performing unfiltered cross-correlations on 20 ms speech

sub-segments and incorporating the results of all TDOA

estimates in a 500 ms speech segment using a histogram.

The TH-PHAT is identical except that the cross-corre-

lation uses a PHAT weighing. As shown, the TFH is

able to match the performance of the time-histograms athigh SNRs (1� RMSE for SNRs higher than 8 dB)

while providing a much lower RMSE at lower SNRs.

Finally, Fig. 15 illustrates the RMSE for all three

different TFH techniques. These techniques corresponds

to TFHs that have incrementing functions defined by

Eqs. (23)–(25). As shown, the TFH-UCC has a lower

RMSE than the other TFH techniques. In fact, the TFH

with the ML weighting function results in a very largeRMSE. All the techniques achieve similar RMSEs at

larger SNRs.

Another simulation was performed but now, instead

of independent additive Gaussian noise signals, a second

male speaker was placed in the environment. The setup

of this new simulation is shown in Fig. 16.

The RMSE results corresponding to this second

simulation are shown in Table 2. As shown, the TFH-

UCC technique again performs better at sub-10 dB

SNRs, while the ML based cross-correlation performs

very well at SNRs of 10 dB or higher.Fig. 17 compares the standard single-segment cross-

correlation techniques with the TFH-UCC technique.

As shown, for SNRs less than 10 dB the TFH-UCC

technique outperforms the other techniques. For larger

SNRs, the ML based generalized cross-correlation

is able to obtain a zero RMSE, outperforming all other

techniques.

Fig. 18 compares the TFH-UCC technique with thetime-histogram techniques. As before, at lower SNRs

the TFH technique performs much better than its time-

histogram counterparts. At high SNRs (>16 dB), the

time-histogram technique with the PHAT-based seg-

ment cross-correlations performs much more accurately

than the other techniques.

Finally, we compare the three TFH techniques. As

shown in Fig. 19, the TFH-UCC performs better thanall other TFH techniques at all SNRs greater than 0 dB.

This performance difference is especially significant in

the 2–16 dB range.

Table 2

The RMSE results corresponding to the simulation of Fig. 16

PHAT ML TH-UCC TH-PHAT TFH-PHAT TFH-UCC TFH-ML

0 dB 46.8181 45.3314 40.9209 39.4357 37.8257 40.5569 36.7308

2 dB 35.1136 35.1136 32.1155 37.317 36.7584 20.6388 35.7137

4 dB 23.4090 23.4090 31.1306 29.8062 32.1041 14.0993 30.9495

6 dB 20.2728 16.5527 25.2178 25.3922 28.1673 9.8807 26.8707

8 dB 16.5527 11.7045 21.6577 21.7975 25.2252 8.9426 26.7301

10 dB 16.5527 0.0000 21.8022 19.8171 25.2267 8.9067 23.9229

12 dB 11.7045 0.0000 17.6957 17.8486 21.8545 8.8853 23.6024

14 dB 11.7045 0.0000 19.7882 15.4871 17.8697 8.8739 19.9271

16 dB 0.0000 0.0000 17.7653 9.018 15.4561 8.8684 15.6338

18 dB 0.0000 0.0000 12.7298 0.7117 12.5879 8.8589 15.4253

20 dB 0.0000 0.0000 15.6437 0.6558 12.5872 8.855 12.5521

Fig. 17. Comparison of standard cross-correlation techniques and the

TFH-UCC technique.

Fig. 18. Comparison of the TFH-UCC with the time-histogram tech-

niques.

120 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122

Page 11: Robust sound localization using conditional time–frequency histograms

9. Conclusions

This paper proposed a weighted TFH technique for

the estimation of time-delays of arrival between two

microphones. This technique is less lenient on the time-

delays of each time–frequency block than traditionalcross-correlation based techniques, having a sharp

rectangular selector function as compared to a cosine

selector function of the cross-correlations. As a result,

for low SNRs with speech, the proposed time–frequency

technique is able to achieve a lower RMSE when com-

pared to other cross-correlation or time-histogram

techniques. For Gaussian noise, and for speech noise at

SNRs greater than or equal to 10 dB, the ML basedgeneralized cross-correlation technique performed better

than all other techniques.

The simulations showed that for sub-10 dB SNRs, the

TFH technique with the TFH-UCC incrementing

function of Eq. (24) had the best results when compared

to the other histogram techniques. It is interesting to

note that the TFH-PHAT technique, which consists of a

TFH with a unit incrementing function, performs worsethan the TFH-UCC technique at almost all SNRs. This

allows us to conclude that the equal treatment of all

time–frequency blocks adversely affects the outcome of

the time-delay estimation process.

This paper did not analyze the effects of reverbera-

tions. It is presumed that in reverberant environments,

extensions such as that proposed in [17] can be effective

at making the histograms robust. This issue was notexplored in this paper, but is an important problem that

requires further analysis in subsequent work. Also, it

remains unclear how the method of [18] would compare

to the TFH method proposed here. While [18] did not

utilize direct TFHs, it did utilize a model based on the

precedence effect and time–frequency smoothing. Hence,

a combination of the two techniques, using the model of[18] and the time–frequency data fusion proposed here,

may result in more accurate TDOA estimates.

One remaining issue is the shape of the selector

function. It was shown here that the TFH technique and

the cross-correlation techniques are in fact equivalent

except that they have different selector functions.

However, the shape of this function requires further

analysis, perhaps resulting in an alternate shape that canoutperform both the TFH and the cross-correlation

techniques.

References

[1] P. Aarabi, Robust sound localization using the formant phase

transform, in: Proceedings of the 5th IEEE Workshop on

Nonlinear Signal and Image Processing (NSIP�01), Baltimore,

Maryland, June 2001.

[2] P. Aarabi, Multi-sense Artificial Awareness, M.A.Sc. Thesis,

Department of Electrical and Computer Engineering, University

of Toronto, June 1999.

[3] P. Aarabi, S. Zaky, Iterative spatial probability based sound

localization, in: The Proceedings of the 4th World Multi-

conference on Circuits, Systems, Computers, and Communica-

tions, Athens, Greece, July 2000.

[4] M.S. Brandstein, J. Adcock, H. Silverman, A practical time-delay

estimator for localizing speech sources with a microphone array,

Computer, Speech, and Language 9 (1995) 153–169.

[5] M.S. Brandstein, H. Silverman, A practical methodology for

speech source localization with microphone arrays, Computer,

Speech, and Language 11 (2) (1997) 91–126.

[6] M.S. Brandstein, H. Silverman, A robust method for speech signal

time-delay estimation in reverberant rooms, in: Proceedings of the

IEEE Conference on Acoustics, Speech, and Signal Processing,

Munich, Germany, April 1997.

[7] K. Guentchev, J. Weng, Learning-based three dimensional sound

localization using a compact non-coplanar array of microphones,

in: Proceedings of the AAAI Symposium on Intelligent Environ-

ments, 1998.

[8] C. Knapp, G. Carter, The generalized correlation method for

estimation of time delay, IEEE Transactions on Acoustics,

Speech, and Signal Processing ASSP-24 (4) (1976) 320–327.

[9] P. Aarabi, Robust speech localization using temporal power

fusion, in: Proceedings of Sensor Fusion: Architectures, Algo-

rithms, and Applications V (SPIE AeroSense�01), Orlando,

Florida, April 2001.

[10] J. Huang, N. Ohnishi, X. Guo, N. Sugie, Echo avoidance in a

computational model of the Precedence effect, Speech Communi-

cation 27 (1999) 223–233.

[11] M.S. Datum, F. Palmieri, A. Moise, An artificial neural network

for sound localization using binaural cues, Journal of the

Acoustical Society of America 100 (1) (1996) 372–383.

[12] J. Backman, M. Karjalainen, Modelling of human directional and

spatial hearing using neural networks, in: Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal

Processing, Minneapolis, MN, April 1993.

[13] J. Huang, T. Supaongprapa, I. Terakura, F. Wang, N. Ohnishi, N.

Sugie, A model based sound localization system and its applica-

tion to robot navigation, Robotics and Autonomous Systems

27(4) (1999) 199–209.

[14] R.O. Schmidt, Multiple emitter location and signal parameter

estimation, IEEE Transactions on Antennas and Propagation AP-

34 (3) (1986) 276–280.

Fig. 19. Comparison of the three different time–frequency histogram

techniques.

P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 121

Page 12: Robust sound localization using conditional time–frequency histograms

[15] H. Watanabe, M. Suzuki, N. Nagai, N. Miki, A method for

maximum likelihood bearing estimation without nonlinear

maximization, Transactions of the Institute of Electronics, Infor-

mation and Communication Engineers J72A (8) (1989) 303–

308.

[16] P. Aarabi, K. Mohajer, Post recognition speech localization,

International Journal of Speech Technology, submitted.

[17] N. Strobel, R. Rabenstein, Classification of time delay estimates

for robust speaker localization, in: Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal

Processing, Phoenix, Arizona, March 1999.

[18] J. Huang, N. Ohnishi, N. Sugie, Sound localization in reverberant

environment based on the model of the Precedence effect, IEEE

Transaction of Instrumentation and Measurement 46 (4) (1997)

842–846.

[19] P. Aarabi, G. Shi, Multi-channel time–frequency data fusion, in:

Proceedings of the 5th International Conference on Information

Fusion, Annapolis, Maryland, July 2002.

122 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122