robust sound localization using conditional time–frequency histograms
TRANSCRIPT
Robust sound localization using conditionaltime–frequency histograms
Parham Aarabi *, Sam Mavandadi
Department of Electrical and Computer Engineering, University of Toronto, 10 Kings College Rd., Toronto, Ont., Canada M5S 3G4
Received 17 September 2001; received in revised form 3 August 2002; accepted 19 September 2002
Abstract
A mechanism for robust time-delay of arrival (TDOA) estimation is proposed using time–frequency histograms (TFHs). By
dividing the signals of two different microphones into time–frequency blocks, TDOA estimates are obtained from the block phase
information and fused using a histogram. It is shown that TFHs are related to the traditional cross-correlation techniques except
that are less lenient on large phase errors. It is also shown that a weighted or conditional TFH technique can achieve better results
than a traditional histogram with unit incrementing functions. Conditional TFHs are explored by the introduction of three his-
togram incrementing functions, among which the magnitude product incrementing function (TFH-UCC) results in the smallest
direction-of-arrival root-mean-square errors (RMSE) in the 0–10 dB signal-to-noise ratio (SNR) range when the noise consists of a
second speaker. For Gaussian noise, or for high SNR (P 10 dB) speech noise, the generalized cross-correlation with the maximum-
likelihood weighting function has the lowest RMSE.
� 2003 Elsevier Science B.V. All rights reserved.
Keywords: Microphone array; Information fusion; Time–frequency analysis; Histogram fusion; Array signal processing
1. Introduction
In many applications, the localization of a sound
source is required. For example, in order to point a
camera toward a speaker in a lecture hall, or to form abeam pattern in order to filter the unwanted noises and
focus on the speaker of interest, the location of the
speaker must first be obtained.
Sound localization may be achieved by means of an
array of microphones [2–7,9]. While different techniques
such as MUSIC [14] and maximum likelihood (ML) [15]
exist, most typical sound localization systems utilize
(due to the complexity and the requirements of MUSICand ML) the time-delay of arrival (TDOA) between
different microphones. A single TDOA constrains the
possible sound source locations to a hyperbola in 2-D,
or a hyperboloid in 3-D. By obtaining TDOA estimates
from multiple microphone pairs, multiple hyperbolae
are obtained whose intersection in space corresponds to
the sound source position. Usually, since TDOA esti-
mates may be noisy, there can often be several hyper-
bolae intersections. However, these intersections usually
occur near the vicinity of the correct source position,
and hence can be used as a means of localizing the
source positions.
In order to obtain an estimate for the TDOA(s) ofdifferent sources at two different microphones, a variety
of techniques can be used. Examples include the gener-
alized cross-correlation technique [1,6], a variety of
histogram techniques [3,9,10,13], and techniques based
on neural networks [11,12]. Generally, cross-correlation
based techniques are better suited in single-source situ-
ations where high estimation accuracy is required [2–
6,9]. In contrast, histogram techniques, which often usecross-correlations on a variety of short signals in order
to obtain multiple TDOA estimates, can deal with a
larger number of sound or noise sources at a cost of
lower accuracy [2,3,9]. Other techniques such as neural
network based [7] or zero-crossing techniques also exist
[10,13], but are often not used because of their relative
inaccuracy and/or the increased computational cost.
However, in practical environments, the robustnessof histograms is often preferred to the accuracy of less
robust techniques [2,17,18]. As a result, several inter-
esting histogram algorithms have been proposed in the
past few years, including iterative histogram generation
*Corresponding author.
E-mail addresses: [email protected] (P. Aarabi), mavanda@
ecf.utoronto.ca (S. Mavandadi).
1566-2535/03/$ - see front matter � 2003 Elsevier Science B.V. All rights reserved.
doi:10.1016/S1566-2535(03)00003-4
Information Fusion 4 (2003) 111–122
www.elsevier.com/locate/inffus
[2], histogram-based classification [17], and restrictedhistograms [18]. The technique proposed by [17] trained
histograms for different direction of arrivals (DOAs) of
the source and used them to classify the corresponding
DOA of a sound segment. Two techniques, including
histogram matching and ML, were proposed. The his-
togram matching resulted in the most accurate sound
localization system although it required 2 s per local-
ization.In this paper we extend the idea of histogram-based
TDOA estimators to time–frequency histograms (TFHs).
The goal here is twofold. First, in situations where dif-
ferent sources are not temporally or spectrally overlap-
ping, TFHs would result in histograms that distinguish
between the different sources. Second, by computing
histograms using time–frequency blocks instead of just
time blocks, a much greater number of blocks becomeavailable resulting in greater robustness.
2. Single segment TDOA estimation using cross-
correlation
There are many different algorithms that attempt to
estimate the most likely TDOA between a pair of ob-servers [4–8]. Usually, these algorithms have a heuristic
measure that estimates the likelihood of each TDOA
and then selects the most likely one.
A simple model of the signal received by two micro-
phones can be defined as [6]:
x1ðtÞ ¼ h1ðtÞsðtÞ þ n1ðtÞ ð1Þx2ðtÞ ¼ h2ðtÞsðt þ sÞ þ n2ðtÞ ð2ÞBasically, both microphones receive a time-delayed
version of the source signal sðtÞ, through channels with
possibly different impulse responses h1ðtÞ and h2ðtÞ, andwith microphone dependent noise signals n1ðtÞ and n2ðtÞ.Generally, the noise sources can be assumed to be un-
correlated, and the impulse responses are assumed to be
fixed for given microphone array and speaker positions.However, these assumptions will not always be true, and
as a result, techniques have been developed that can to a
certain extent tolerate the failure of these assumptions
[6]. The main problem is to estimate s given the micro-
phone signals x1ðtÞ and x2ðtÞ. The most common solu-
tion to this problem is the single segment generalized
cross-correlation approach shown below [8]:
s� ¼ argmaxb
Z 1
�1W ðxÞX1ðxÞX2ðxÞejxb dx ð3Þ
where s� is an estimate of the original source signal delaybetween the two microphones. The actual choice of the
weighing function W ðxÞ has been studied in detail for
general sound and speech sources, and three different
choices, the ML [6,7], the phase transform [6,8], and the
unfiltered cross-correlation [7,8] are shown below:
WMLðxÞ ¼ jX1ðxÞjjX2ðxÞjjN1ðxÞj2jX2ðxÞj2 þ jX1ðxÞj2jN2ðxÞj2
ð4Þ
WPHATðxÞ ¼ 1
jX1ðxÞX2ðxÞj ð5Þ
WUCCðxÞ ¼ 1 ð6ÞThe ML weights require knowledge about the spectrum
of the microphone dependent noises. The phase trans-
form does not require this knowledge, and hence hasbeen employed more often due to its simplicity. The
unfiltered cross-correlation does not utilize any weighing
function.
3. Multi-segment TDOA estimation using the cross-
correlation
In situations where a large speech segment has beenrecorded by two microphones, the standard version of
the generalized cross-correlation can be applied [8]. It
should be noted that the version of the generalized cross-
correlation presented in Eqs. (3), (5), (6) was only a
single segment approximation of the generalized cross-
correlation technique shown below:
s� ¼ argmaxb
Z 1
�1bWW ðxÞbGGX1X2ðxÞejxb dx ð7Þ
where bWW ðxÞ is a weighting function obtained from the
estimated cross-power spectral density and the estimatedpower spectral density and bGGX1X2ðxÞ is the estimated
cross-power spectral density of the signals X1ðxÞ and
X2ðxÞ. Note that by estimated cross-power spectral
density, it is meant that the signals are divided into small
segments (typically 20 ms long), with each segment
being processed by a Hanning window, and the average
of the cross-power spectral densities of all the segments
is then taken and used as the estimated cross-powerspectral density. A similar procedure was performed for
estimating the power spectral density. The ML and the
PHAT weighting functions corresponding to this case
are defined as [8]:
bWWMLðxÞ ¼ jbGGX1X2ðxÞjjbGGX1X1ðxÞbGGX2X2ðxÞj � jbGGX1X2ðxÞj2
ð8Þ
bWWPHATðxÞ ¼ 1
jbGGX1X2ðxÞjð9Þ
where bGGX1X1ðxÞ and bGGX2X2ðxÞ are the estimated power
spectral densities of X1ðxÞ and X2ðxÞ, respectively.In this paper, when we compare the generalized cross-
correlation techniques for long segments, we will utilize
the method presented here. For single segment esti-mates, such as those required for histograms, the single
segment methods of the previous section will be used.
112 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122
4. TDOA fusion using histograms
Often, due to the presence of multiple sound sources,
it is necessary to obtain both the number and the like-
lihood of different TDOAs by estimating their proba-
bility distribution function. In practical situations, the
simplest method of estimating the probability distribu-
tion is to obtain multiple TDOA estimates that can be
fused using a histogram. In the limit, as the number ofTDOA estimates approaches infinity (while the sound
sources remain stationary), the normalized TDOA his-
togram will approach the distribution of the TDOAs.
Assuming that a TDOA histogram bin is centered at
TDOA s, has width 2e, and a total of N TDOA esti-
mates are available with TDOA result si for the ith es-
timate, the histogram bin hðsÞ can be defined as:
hðsÞ ¼ 1
N
XNi¼1
Z sþe
s�edðt � siÞdt ð10Þ
By the law of large numbers, assuming N is large, the
summation over i in the limit becomes equivalent to theexpectation, which, once simplified, reduces to:
hðsÞ Z sþe
s�efTDOAðtÞdt ð11Þ
where fTDOAðtÞ is the probability density function of the
TDOAs. In the limit, as e approaches 0, and as long asthe histograms are normalized (constant scaling of the
entire histogram), the histogram will be proportional to
the probability density function of the TDOAs.
4.1. Frequency histograms
One method of obtaining multiple TDOA estimates
and then combining them in a histogram is by using a
frequency-band-selected TDOA histogram [10,13]. Thisapproach divides up the spectrum of the signal into
multiple frequency bands with each band selecting a
single TDOA for histogram fusion. In order to obtain a
histogram that is representative of the distribution of the
TDOAs, many frequency bands are required (Fig. 1).
Because the phase of each of the frequency blocks
wraps at 2p, depending on the time-delay and the fre-
quency of the block, it is sometimes impossible to obtaina single TDOA estimate. This is more problematic at
higher frequencies, where the phase wrapping phenom-
enon can affect even small TDOAs. In these situations,
we can extract all possible TDOAs so that instead of
having a single selection for each frequency block, we
have multiple TDOAs, only one of which is correct. This
involves utilizing all of the peaks of the correlation
function where for each peak the corresponding TDOAhistogram bin is incremented. This can be viewed as a
histogram aliasing phenomenon.
Frequency histograms are effective when different
sound sources have mutually exclusive frequency com-ponents [10,13]. However, when several sources have the
same spectral contents, then frequency histograms are
often incapable of estimating the TDOAs of all sound
sources. For this reason, frequency histograms are
generally not used in practice.
However, frequency related techniques such as the
one proposed in [18], which basically proposed a multi-
frequency onset histogram with certain restrictions, havebeen shown to be able to localize sound source direc-
tions to within a few degrees. Nevertheless, by only
utilizing single frequency bands, they have less potential
at resolving multiple speakers as TFHs. It should be
noted that while the technique of [18] did compute
frequency histograms, it combined multiple frequency
histograms and sequential time segments by using a
Gaussian smoothing function. Hence, in effect, [18] didcombine time–frequency information, although the his-
tograms where computed across frequencies only.
4.2. Temporal histograms
A more commonly used histogram technique is tem-
poral histograms, also known as iterative spatial prob-
ability [3] or temporal power fusion [9] based TDOA
estimation. The idea with temporal histograms is that, in
the case of multiple speech sources, not all sources will
be simultaneously active. As a result, the periods duringwhich a source is silent are used to collect information
regarding the other sound sources, and vice versa. This
is accomplished by temporally breaking up the speech
Fig. 1. TDOA estimates corresponding to different frequency blocks
are incorporated into a TDOA histogram in order to approximate the
TDOA distribution.
P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 113
signals into small segments (typically 20 ms long) whose
TDOA estimates (obtained by a weighted cross-corre-
lation) are incorporated into a histogram, as shown in
Fig. 2. By obtaining many speech segments, numerous
TDOA estimates become available resulting in a histo-gram that is representative of the TDOA distribution.
In order to analyze the effectiveness of the temporal
histogram TDOA estimation technique, a simulation
with two male speakers was performed. Assuming that
one of the sources is considered to be the signal of
interest (with a TDOA of )5 samples using a 20 kHz
sampling rate) and the other considered as noise (TDOA
of 5 samples), the signal-to-noise ratio (SNR) was variedfrom 30 to )30 dB, with the normalized temporal his-
togram corresponding to 800 ms speech segments plot-
ted in Fig. 3. As shown, above 0 dB the histogram has a
dominant peak at the signal TDOA ()5) and below 0 dB
the histogram shows a dominant peak at the noise
TDOA (5).
Time-histograms are often used successfully in prac-tice. One interesting implementation was that of [17]. In
[17] a histogram training phase was proposed, where, for
a finite number of direction-of-arrivals, a TDOA histo-
gram was produced (using multiple time segments from
a long sound signal segment). The resulting histograms
were then compared using a chi-squared metric to all the
histograms in the database for the different DOAs. The
DOA corresponding to the most similar histogramwould then be identified as the DOA of the segment.
This DOA estimation strategy results in a slow local-
ization rate (0.5 Hz). An alternative strategy was also
proposed by [17] which, by training histograms, would
compute the ML direction of arrival from a single
speech frame. This resulted in a substantial increase in
the localization rate, although the increase was at a cost
of a less accurate sound localization system.It should be noted that the ideas proposed in [17],
which are quite effective at dealing with reverberations,
are not a new histogram technique but an extension of
time-histogram techniques. Hence, the idea of TFHs
that will be introduced in the next section can be ex-
tended in a similar fashion to that of [17].
5. Time–frequency histograms
Having seen both temporal and frequency histo-grams, it becomes clear that the two techniques can be
combined to produce TFHs. This involves breaking up
the signals both temporally and spectrally, and then
incorporating the resulting TDOA of each block into a
histogram, as shown in Fig. 4.
As mentioned previously, temporal histograms work
well when multiple sound sources are not temporally
overlapping. Although speech sources usually occupythe same general frequencies and can often be tempo-
rally overlapping, in many cases, due to the differences
in the vowels spoken by different speakers, multiple
speech sources tend to occupy different frequency blocks
Fig. 2. TDOA estimates corresponding to different temporal blocks are
incorporated in a temporal histogram.
Fig. 3. Temporal histograms as a function of SNR. Note that white corresponds to higher probability.
114 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122
at different times. Hence, TFHs should surpass the
performance of temporal histograms.
The speech signals used in the simulation of Fig. 3
were used once again in order to simulate TFHs atdifferent SNRs. As shown in Fig. 5, the TFHs are much
cleaner than their temporal counterparts. This is mainly
because many more TDOA estimates are available in the
TFH case than the temporal case, and also because the
TFH technique allows for the spectral separation of
different speech sources unlike the temporal case. Note
that the symmetrical nature of the histograms with re-
spect to the SNRs is maintained in the TFH case as inthe temporal case.
6. Conditional TF histograms
One issue regarding frequency and TFHs is the
amount of contribution that is made by each frequency
or each time–frequency block. With time-histograms, itis reasonable to assume that the TDOA estimate of each
block should be equally significant; therefore the con-
tributions to the histogram would be the same. In other
words, every time a TDOA estimate is obtained from a
signal segment, the corresponding TDOA bin of the
histogram is incremented by a constant value of 1.
However, should a frequency or time–frequency block
at 1 kHz be as important as one at 5 or 10 kHz? Due tothe spectral characteristics of speech, a frequency de-
pendent histogram incrementing technique might be
more appropriate than equal contributions from all
frequencies.
6.1. TFH with binary priors
The simplest block dependent histogram increment-
ing technique is a binary prior likelihood model [1,16].
Here, either a block is used in the histogram or it is not
used. For example, it would make sense for the sound
signal whose spectrogram is shown in Fig. 6 to only usethe blocks in which the main signal is dominant (darker
Fig. 4. TDOA estimates from time–frequency blocks are incorporated
into a time–frequency histogram.
Fig. 5. Time–frequency histograms as a function of SNR.
Fig. 6. Blocks used for a conditional TFH corresponding to a voiced
speech segment. The dark circles correspond to blocks used.
P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 115
color), or in other words, to use a block that lies on theformants of the speech signal. This technique would be
similar to the formant phase transform that has been
utilized with success in recent years [1,16]. Of course,
this requires prior knowledge regarding the location of
the formants.
The same 800 ms sound signal that was used to
generate the SNR dependent histograms of Figs. 3 and 5
was also used to generate conditional TFH with binarypriors. Here, the formant structure (only F1 and F2) of
the speech source of interest (with a )5 sample TDOA)
is used to generate the binary block likelihoods. The
resulting SNR dependent histograms, shown in Fig. 7,
illustrate the advantage of conditional TFH, since the
entire histogram now shows a consistent peak at )5samples, even at SNRs as low as )20 dB. Note that theseresults are obtained from [16], and that a more detaileddiscussion of binary priors pertaining to speech sources
appears in [16].
6.2. TFH with multi-valued priors
Although binary priors for conditional TFHs are in
some cases easy to obtain, it is usually more appropriate
to use weights that are not just 0 or 1, but can have a
range of values. Here, we modify the histogram incre-
menting function of Eq. (10) as follows:
hðsÞ ¼ 1
N
XNi¼1
cðiÞZ sþe
s�edðt � siÞdt ð12Þ
where cðiÞ is a scaling block for the ith time–frequencyblock.
The goal here is to weigh the TDOA estimates of
the different frequency blocks in an �optimal� manner.We will now attempt to find information about the
TDOA incrementing weights from the generalized cross-
correlation of Eq. (3). By noticing that the cross-
correlation is always real, the discrete-frequency version
of Eq. (3) can be stated as:
s� ¼ argmaxb
Xx
W ðxÞjX1ðxÞjjX2ðxÞj cosð\X1ðxÞ
� \X2ðxÞ þ xbÞ ð13Þ
where s� is the cross-correlation estimate of the TDOA.
This cross-correlation can be viewed as the maximiza-
tion of a likelihood function for the TDOA, as follows:
s� ¼ argmaxb
‘ðbÞ ð14Þ
where,
‘ðbÞ ¼X
x
W ðxÞjX1ðxÞjjX2ðxÞj cosð\X1ðxÞ
� \X2ðxÞ þ xbÞ ð15Þ
Now, we notice that the likelihood function is composed
of a weighting function WðxÞ and a selector function
nðxÞ as shown below:
‘ðbÞ ¼X
x
WðxÞnðxÞ ð16Þ
where,
WðxÞ ¼ W ðxÞjX1ðxÞjjX2ðxÞj ð17ÞnðxÞ ¼ cosð\X1ðxÞ � \X2ðxÞ þ xbÞ ð18Þ
Hence, the cross-correlation likelihood function is
composed of a cosine phase error selector function. The
phase error is a term that was initially proposed by [19],
and is defined as \X1ðxÞ � \X2ðxÞ þ xb. Basically,
when b is close to the correct time-delay, the phase error
will be small, and when b is far from the correct time-
delay, the phase error will be large. Consequently, theselector function will act as a phase error punish–reward
technique. The generalized cross-correlation, with its
cosine selector function, is lenient on larger phase errors.
As shown in Fig. 8, a sharper selector function can lead
to a better TDOA estimate because of its greater at-
tenuation of larger phase errors.
The rectangular selector function of Fig. 8 is defined
as:
Fig. 7. SNR dependent conditional TFHs using prior knowledge regarding the formant structure (F1 and F2) of the sound signal of interest.
116 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122
nðxÞ ¼ 1 j\X1ðxÞ � \X2ðxÞ þ xbj6 e0 j\X1ðxÞ � \X2ðxÞ þ xbj > e
�ð19Þ
where e is a possibly frequency dependent width of the
rectangular selector function. This rectangular selectorfunction in fact makes the generalized cross-correlation
likelihood equivalent to a TFH (in the case that e is
equal to a constant a times the frequency x). By using
the sharper selector function, Eq. (18) becomes:
‘ðbÞ ¼X
8x;j\X1ðxÞ�\X2ðxÞþxbj6 e
W ðxÞjX1ðxÞjjX2ðxÞj ð20Þ
which is simply a weighted count of all time–frequencyblocks whose phase difference corresponds to a TDOA
equal to b. Here, each block is counted in the histogram
bin centered at b as long as:
�b � a <\X1ðxÞ � \X2ðxÞ
x< �b þ a ð21Þ
where a is half the width of each of the TDOA histo-
gram bins. Eq. (21) allows us to build a histogram
completely from the phase information (and amplitude
information in case of the UCC or the ML incrementing
functions defined below) without the need for any cross-
correlation.The above analysis illustrates that the TDOA likeli-
hood function derived from the cross-correlation is in
fact equivalent to a TFH with a weighted counter. This
counter, or histogram incrementing function for the ithtime–frequency block, is defined as:
cðiÞ ¼ W ðxÞjXi;1jjXi;2j ð22ÞThis allows us to use the weighing functions derived for
the cross-correlation. Hence, we propose the followingthree histogram incrementing functions:
cMLðiÞ ¼jXi;1j2jXi;2j2
jNi;1j2jXi;2j2 þ jXi;1j2jNi;2j2ð23Þ
cUCCðiÞ ¼ jXi;1jjXi;2j ð24Þ
cPHATðiÞ ¼ 1 ð25Þ
The phase transform incrementing function is equivalent
to traditional histograms with unit incrementing func-
tions.
7. Histogram smoothing
Histograms are often noisy, due to the limited num-
ber of blocks (be it time blocks, frequency blocks, or
time–frequency blocks) that are available. An example isFig. 8. A cosine phase error selector function, which is lenient on large
phase errors, and a sharp rectangular phase error selector function.
Fig. 9. A noisy histogram where the correct TDOA of )10 samples
cannot be obtained by taking the TDOA at the histogram maximum.
Fig. 10. A Gaussian smoothing function.
P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 117
shown in Fig. 9, where the correct TDOA of )10 sam-ples corresponds to a sharp peak. However, there is a
secondary peak at )7 samples whose histogram value is
the global maximum. As a result, a simple peak selection
technique would mistakenly choose the )7 samples
TDOA over the correct )10 samples TDOA.
This situation can be resolved with a histogram fil-
tering or smoothing function. By convolving the noisy
histogram of Fig. 9 with a smoothing function, such asthe Gaussian one in Fig. 10, it is possible to estimate the
correct TDOA much more robustly. For example, the
convolution of Figs. 9 and 10, whose result is shown in
Fig. 11, now has a single dominant peak at )10 samples.This histogram smoothing operation is performed in all
the results that are discussed in this paper. The main
reason for smoothing the histograms, versus using a
larger bin size, is that the histogram smoothing can re-sult in a higher accuracy than a wide-bin histogram.
Alternatively, an iterative histogram technique where
the bin sizes gradually decrease could also be used.
8. Simulation results
In order to assess the performance of the proposed
TFHs, several simulations were performed. All of the
simulations used 100 500 ms speech segments obtained
from a male speaker sampled at 44.1 kHz. For the
histogram-based techniques, the 500 ms segments where
sectioned to 100 20 ms sub-segments with 15 ms over-lapping sections. For the standard cross-correlation
techniques, the generalized cross-correlation was per-
formed on the entire 500 ms (using 20 ms sub-segments
for the cross-power spectral density and the power
spectral density estimation). In all cases, before the
processing of each sub-segment, a Hanning window was
used as the windowing function.
In the simulations, a 15 cm inter-microphone distancewas used and the main speaker was placed as shown in
Fig. 12. After the microphone signals were received,
independent white Gaussian noise was added to each
channel. For each speech segment and at each SNR, the
simulation was repeated 10 times in order to average the
randomness in the Gaussian noise signals.
The DOA root-mean-square error (RMSE) measure
in degrees was used to identify the relative error of each
Fig. 11. The result of the convolution of the noisy histogram of Fig. 9
with the histogram smoothing function of Fig. 10. Now, the correct
TDOA of )10 samples can be easily selected from the smoothed his-
togram.
Fig. 12. The setup for the first set of simulations.
Table 1
The root-mean-square DOA error for all of the techniques at SNRs from 0 to 20 dB
PHAT ML TH-UCC TH-PHAT TFH-PHAT TFH-UCC TFH-ML
0 dB 8.1497 2.5553 7.5568 11.8379 6.5897 4.4889 11.2817
2 dB 4.6862 2.0556 5.383 8.6119 4.6874 2.478 9.5758
4 dB 4.0377 1.4581 6.0438 9.7923 3.1758 2.4596 10.082
6 dB 3.0091 0.8297 5.3288 5.7857 4.2217 1.1882 8.5638
8 dB 2.5254 0.6013 0.9072 5.5829 2.7175 1.1878 9.2349
10 dB 1.6959 0.4731 0.8735 4.0009 2.6096 0.9382 4.8139
12 dB 1.4912 0.2719 0.8492 1.4289 1.2295 0.8484 6.3338
14 dB 1.3637 0.1997 3.7948 1.189 0.8217 0.8968 7.9343
16 dB 1.1646 0.1287 0.7685 1.1364 0.9488 0.7601 6.5276
18 dB 0.9072 0.0922 0.7783 1.038 0.9756 0.7925 4.0922
20 dB 0.8025 0.0582 0.703 0.9591 0.8257 0.7991 0.931
118 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122
of the techniques at a given SNR. This error can beobtained from the estimate time-delay s as follows:
Dh ¼ arcsinsvd
� �� 30� ð26Þ
where v is the speed of sound in air (about 345 m/s), d isthe inter-microphone distance (in this case 0.15 m), and
Dh is the error in the DOA. By obtaining many different
DOA errors for the different 500 ms segments and for
different random noises that were added to the signals of
each channel, an overall root-mean-square DOA error
was obtained for each technique at each given SNR. The
results for all of the techniques are shown in Table 1.
Clearly, the most accurate TDOA estimation tech-nique at all SNRs is the ML based generalized cross-
correlation technique followed by the TFH technique
with the UCC incrementing function (TFH-UCC),
which performs better than the other techniques (except
the ML based cross-correlation) at lower SNRs. The
best TFH technique (TFH-UCC) is compared to the
traditional cross-correlation techniques in Fig. 13.
As shown in Fig. 13, for low SNRs (<10 dB) theTFH-UCC technique has an RMSE that is half that of
the next best cross-correlation based TDOA estimation
technique. For higher SNRs, cross-correlation based
techniques have a smaller RMSE, partly because of the
discretization introduced by the finite-width of the his-
togram bins. In Fig. 13, the PHAT, ML, and UCC
techniques involve performing a single cross-correlation
with the corresponding weighing function on the entire500 ms speech segment. The TFH-UCC technique
obtains time-delay estimates from the time–frequency
blocks of the large speech segment and then utilizes a
histogram in order to fuse the obtained results.
Fig. 14 compares the TFH-UCC technique with other
time-histograms. Here, the TH-UCC technique involves
Fig. 13. Comparison of traditional cross-correlation based TDOA
estimators along with the TFH-UCC histogram technique.
Fig. 14. Root-mean-square error of time-histogram techniques and the
TFH-UCC technique.
Fig. 15. Root-mean-square errors for three different time–frequency
histogram techniques.
Fig. 16. A setup with two speakers used for the simulations of Figs. 17–
19.
P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 119
performing unfiltered cross-correlations on 20 ms speech
sub-segments and incorporating the results of all TDOA
estimates in a 500 ms speech segment using a histogram.
The TH-PHAT is identical except that the cross-corre-
lation uses a PHAT weighing. As shown, the TFH is
able to match the performance of the time-histograms athigh SNRs (1� RMSE for SNRs higher than 8 dB)
while providing a much lower RMSE at lower SNRs.
Finally, Fig. 15 illustrates the RMSE for all three
different TFH techniques. These techniques corresponds
to TFHs that have incrementing functions defined by
Eqs. (23)–(25). As shown, the TFH-UCC has a lower
RMSE than the other TFH techniques. In fact, the TFH
with the ML weighting function results in a very largeRMSE. All the techniques achieve similar RMSEs at
larger SNRs.
Another simulation was performed but now, instead
of independent additive Gaussian noise signals, a second
male speaker was placed in the environment. The setup
of this new simulation is shown in Fig. 16.
The RMSE results corresponding to this second
simulation are shown in Table 2. As shown, the TFH-
UCC technique again performs better at sub-10 dB
SNRs, while the ML based cross-correlation performs
very well at SNRs of 10 dB or higher.Fig. 17 compares the standard single-segment cross-
correlation techniques with the TFH-UCC technique.
As shown, for SNRs less than 10 dB the TFH-UCC
technique outperforms the other techniques. For larger
SNRs, the ML based generalized cross-correlation
is able to obtain a zero RMSE, outperforming all other
techniques.
Fig. 18 compares the TFH-UCC technique with thetime-histogram techniques. As before, at lower SNRs
the TFH technique performs much better than its time-
histogram counterparts. At high SNRs (>16 dB), the
time-histogram technique with the PHAT-based seg-
ment cross-correlations performs much more accurately
than the other techniques.
Finally, we compare the three TFH techniques. As
shown in Fig. 19, the TFH-UCC performs better thanall other TFH techniques at all SNRs greater than 0 dB.
This performance difference is especially significant in
the 2–16 dB range.
Table 2
The RMSE results corresponding to the simulation of Fig. 16
PHAT ML TH-UCC TH-PHAT TFH-PHAT TFH-UCC TFH-ML
0 dB 46.8181 45.3314 40.9209 39.4357 37.8257 40.5569 36.7308
2 dB 35.1136 35.1136 32.1155 37.317 36.7584 20.6388 35.7137
4 dB 23.4090 23.4090 31.1306 29.8062 32.1041 14.0993 30.9495
6 dB 20.2728 16.5527 25.2178 25.3922 28.1673 9.8807 26.8707
8 dB 16.5527 11.7045 21.6577 21.7975 25.2252 8.9426 26.7301
10 dB 16.5527 0.0000 21.8022 19.8171 25.2267 8.9067 23.9229
12 dB 11.7045 0.0000 17.6957 17.8486 21.8545 8.8853 23.6024
14 dB 11.7045 0.0000 19.7882 15.4871 17.8697 8.8739 19.9271
16 dB 0.0000 0.0000 17.7653 9.018 15.4561 8.8684 15.6338
18 dB 0.0000 0.0000 12.7298 0.7117 12.5879 8.8589 15.4253
20 dB 0.0000 0.0000 15.6437 0.6558 12.5872 8.855 12.5521
Fig. 17. Comparison of standard cross-correlation techniques and the
TFH-UCC technique.
Fig. 18. Comparison of the TFH-UCC with the time-histogram tech-
niques.
120 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122
9. Conclusions
This paper proposed a weighted TFH technique for
the estimation of time-delays of arrival between two
microphones. This technique is less lenient on the time-
delays of each time–frequency block than traditionalcross-correlation based techniques, having a sharp
rectangular selector function as compared to a cosine
selector function of the cross-correlations. As a result,
for low SNRs with speech, the proposed time–frequency
technique is able to achieve a lower RMSE when com-
pared to other cross-correlation or time-histogram
techniques. For Gaussian noise, and for speech noise at
SNRs greater than or equal to 10 dB, the ML basedgeneralized cross-correlation technique performed better
than all other techniques.
The simulations showed that for sub-10 dB SNRs, the
TFH technique with the TFH-UCC incrementing
function of Eq. (24) had the best results when compared
to the other histogram techniques. It is interesting to
note that the TFH-PHAT technique, which consists of a
TFH with a unit incrementing function, performs worsethan the TFH-UCC technique at almost all SNRs. This
allows us to conclude that the equal treatment of all
time–frequency blocks adversely affects the outcome of
the time-delay estimation process.
This paper did not analyze the effects of reverbera-
tions. It is presumed that in reverberant environments,
extensions such as that proposed in [17] can be effective
at making the histograms robust. This issue was notexplored in this paper, but is an important problem that
requires further analysis in subsequent work. Also, it
remains unclear how the method of [18] would compare
to the TFH method proposed here. While [18] did not
utilize direct TFHs, it did utilize a model based on the
precedence effect and time–frequency smoothing. Hence,
a combination of the two techniques, using the model of[18] and the time–frequency data fusion proposed here,
may result in more accurate TDOA estimates.
One remaining issue is the shape of the selector
function. It was shown here that the TFH technique and
the cross-correlation techniques are in fact equivalent
except that they have different selector functions.
However, the shape of this function requires further
analysis, perhaps resulting in an alternate shape that canoutperform both the TFH and the cross-correlation
techniques.
References
[1] P. Aarabi, Robust sound localization using the formant phase
transform, in: Proceedings of the 5th IEEE Workshop on
Nonlinear Signal and Image Processing (NSIP�01), Baltimore,
Maryland, June 2001.
[2] P. Aarabi, Multi-sense Artificial Awareness, M.A.Sc. Thesis,
Department of Electrical and Computer Engineering, University
of Toronto, June 1999.
[3] P. Aarabi, S. Zaky, Iterative spatial probability based sound
localization, in: The Proceedings of the 4th World Multi-
conference on Circuits, Systems, Computers, and Communica-
tions, Athens, Greece, July 2000.
[4] M.S. Brandstein, J. Adcock, H. Silverman, A practical time-delay
estimator for localizing speech sources with a microphone array,
Computer, Speech, and Language 9 (1995) 153–169.
[5] M.S. Brandstein, H. Silverman, A practical methodology for
speech source localization with microphone arrays, Computer,
Speech, and Language 11 (2) (1997) 91–126.
[6] M.S. Brandstein, H. Silverman, A robust method for speech signal
time-delay estimation in reverberant rooms, in: Proceedings of the
IEEE Conference on Acoustics, Speech, and Signal Processing,
Munich, Germany, April 1997.
[7] K. Guentchev, J. Weng, Learning-based three dimensional sound
localization using a compact non-coplanar array of microphones,
in: Proceedings of the AAAI Symposium on Intelligent Environ-
ments, 1998.
[8] C. Knapp, G. Carter, The generalized correlation method for
estimation of time delay, IEEE Transactions on Acoustics,
Speech, and Signal Processing ASSP-24 (4) (1976) 320–327.
[9] P. Aarabi, Robust speech localization using temporal power
fusion, in: Proceedings of Sensor Fusion: Architectures, Algo-
rithms, and Applications V (SPIE AeroSense�01), Orlando,
Florida, April 2001.
[10] J. Huang, N. Ohnishi, X. Guo, N. Sugie, Echo avoidance in a
computational model of the Precedence effect, Speech Communi-
cation 27 (1999) 223–233.
[11] M.S. Datum, F. Palmieri, A. Moise, An artificial neural network
for sound localization using binaural cues, Journal of the
Acoustical Society of America 100 (1) (1996) 372–383.
[12] J. Backman, M. Karjalainen, Modelling of human directional and
spatial hearing using neural networks, in: Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal
Processing, Minneapolis, MN, April 1993.
[13] J. Huang, T. Supaongprapa, I. Terakura, F. Wang, N. Ohnishi, N.
Sugie, A model based sound localization system and its applica-
tion to robot navigation, Robotics and Autonomous Systems
27(4) (1999) 199–209.
[14] R.O. Schmidt, Multiple emitter location and signal parameter
estimation, IEEE Transactions on Antennas and Propagation AP-
34 (3) (1986) 276–280.
Fig. 19. Comparison of the three different time–frequency histogram
techniques.
P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122 121
[15] H. Watanabe, M. Suzuki, N. Nagai, N. Miki, A method for
maximum likelihood bearing estimation without nonlinear
maximization, Transactions of the Institute of Electronics, Infor-
mation and Communication Engineers J72A (8) (1989) 303–
308.
[16] P. Aarabi, K. Mohajer, Post recognition speech localization,
International Journal of Speech Technology, submitted.
[17] N. Strobel, R. Rabenstein, Classification of time delay estimates
for robust speaker localization, in: Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal
Processing, Phoenix, Arizona, March 1999.
[18] J. Huang, N. Ohnishi, N. Sugie, Sound localization in reverberant
environment based on the model of the Precedence effect, IEEE
Transaction of Instrumentation and Measurement 46 (4) (1997)
842–846.
[19] P. Aarabi, G. Shi, Multi-channel time–frequency data fusion, in:
Proceedings of the 5th International Conference on Information
Fusion, Annapolis, Maryland, July 2002.
122 P. Aarabi, S. Mavandadi / Information Fusion 4 (2003) 111–122