particle filtering for bearing-only audio-visual speaker detection and tracking
TRANSCRIPT
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 1/6
Particle Filtering for Bearing-Only Audio-Visual
Speaker Detection and TrackingAndrew Rae, Member, IEEE, Alaa Khamis, Member, IEEE, Otman Basir, Member, IEEE, and
Mohamed Kamel, Fellow, IEEE
Abstract—We present a method for audio-visual speaker de-tection and tracking in a smart meeting room environmentbased on bearing measurements and particle filtering. Bear-ing measurements are determined using the Time Differenceof Arrival (TDOA) of the acoustic signal reaching a pair of microphones, and by tracking facial regions in images frommonocular cameras. A particle filter is used to sample the spaceof possible speaker locations within the meeting room, and tofuse the bearing measurements from auditory and visual sources.The proposed system was tested in a video messaging scenario,using a single participant seated in front of a screen to which
a camera and microphone pair are attached. The experimentalresults show that the accuracy of speaker tracking using bearingmeasurements is related to the location of the speaker relativeto the locations of the camera and microphones, which can bequantified using a parameter known as Dilution of Precision.
Index Terms—Direction of arrival estimation, acoustic arrays,machine vision, position measurement, Monte Carlo methods
I. INTRODUCTION
SPEAKER tracking has received attention in recent years
for its use in smart meeting room and classroom envi-
ronments to identify and track the active participant. This
information is applicable to distributed meetings where the
location of the speaker can be used to zoom in on the speaking
person, or to select the camera that provides the best view of
the speaker [1]. This allows remote meeting participants to
observe the facial expressions and gestures of the speaker,
thereby facilitating natural human interaction.
Speaker tracking is a relevant problem to the field of mul-
timodal target tracking, as it cannot be addressed using only
audio or visual sensing modalities [2]. Audio-visual speaker
tracking takes advantage of the complementary nature of the
audio and visual modalities to track the location of and/or
identify a speaking person. For example, the location of the
speaker can be determined by TDOA or steered beamforming
methods, provided an array of microphones are available [3].However, sound source localization is degraded in reverber-
ant environments, and not applicable during the absence of
A. Rae is formerly with the Pattern Analysis and Machine Intelligence(PAMI) research group, Department of Electrical and Computer Engineering,University of Waterloo, Canada.
A. Khamis is a Research Assistant Professor with the Department of Electrical and Computer Engineering, University of Waterloo, Canada.
O. Basir is Associate Director of the PAMI research group and an Asso-ciate Professor in the Department of Electrical and Computer Engineering,University of Waterloo, Canada.
M. Kamel is Director of the PAMI research group and a Professor in theDepartment of Electrical and Computer Engineering, University of Waterloo,Canada.
speech. Machine vision can be used to track people in the
environment (e.g., [1], [2]), however it may not be possible to
determine decisively which person is speaking at a given time.
Fusing auditory and visual measurements should thus enable
the speaker to be located with greater reliability than either
modality alone.
We propose a method of audio-visual speaker tracking
using bearing measurements provided by audio and video
subsystems. By positioning cameras and microphone arrays at
separate locations throughout the meeting room environment,bearing measurements determined using the data from each
sensor are used to estimate the two dimensional (x-y) position
of the speaker in the room. Particle filtering is used to consider
multiple location hypotheses for the speaker throughout the
environment that are weighted by the bearing measurements
from the audio and visual sources.
The remainder of the paper is organized as follows. Sec-
tion II describes the bearing-only tracking method based on
a particle filtering framework. Section IV summarizes an im-
plementation of the system and preliminary results. Section V
gives conclusions and future research directions.
I I . BEARING-ONLY TRACKING METHOD
Our approach to audio-visual speaker tracking is to use
bearing measurements to locate and track the speaker in a
meeting room environment. The problem is formulated as a
two-dimensional tracking problem, where the x–y position of
the speaker in the horizontal plane is desired. Tracking the
vertical position of the speaker’s face would provide useful
information that would facilitate an automatic zoom onto
the speaker for more natural remote interaction, however the
vertical position is ignored for simplicity at this time.
A. Notation
We define in this section the symbols and notation used
in the audio-visual speaker tracking approach. The horizontal
position of the speaker is expressed in a modified polar
coordinate system by the state vector xt, defined in (1). 1ρt
is
the reciprocal of the distance from the origin of the coordinate
system to the speaker, and θt is the angular position of the
speaker. A polar coordinate system would use the distance ρtrather than its reciprocal. The subscript t represents the time
index.
xt =
1ρtθt
(1)
2009 International Conference on Signals, Circuits and Systems
-1-978-1-4244-4398-7/09/$25.00 ©2009 IEEE
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 2/6
The particle filter uses a set of N samples {xit}N i=1 to
approximate the posterior probability distribution of the state
vector xt conditioned on a sequence of measurements z1:t ={z1, . . . , zt} [4], as shown in (2). Each particle i ∈ [1, N ]has a weight wi
t that represents its relative importance to the
particle set.
p(xt|z1:t) =
N i=1
wi
tδ(xt − xi
t) (2)
The measurement vector zt contains the bearing measure-
ments from both the audio and video modalities at time instant
t. Let M be the number of microphone pairs, and let K be the
number of cameras. The location of microphones and cameras
are assumed known a priori. An estimate of bearing to the
sound source is made for each microphone pair m ∈ [1, M ]based on the time-difference of arrival of the sound signal to
each microphone in the pair. This bearing estimate is relative
to the midpoint between the two microphones in the pair [2].
We denote this bearing estimate as αmt .
Bearing estimates to the various meeting participants are
made by tracking the participants visually in the images
provided by each camera. Each such camera system k ∈ [1, K ]tracks multiple participants independently of the other camera
systems providing Lkt bearing estimates, one for each tracked
participant. The set of bearing estimates from camera k is
denoted Γkt = {γ
k,t }
Lkt
=1.
The measurement vector zt is defined in (3) as the set of
bearing measurements from all M microphone pairs and all
K camera systems.
zt =
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣
α1t
...
αM t
Γ1t
...
ΓKt
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦
(3)
Fig. 1 shows an example meeting room configuration. One
pair of microphones (M = 1) provides a bearing measurement
to the speaker αt using TDOA, while two camera systems
(K = 2) give bearing estimates Γ1t and Γ2
t to the tracked
participants.
B. Particle Filter Tracker
A particle filtering approach is used to track the locationof speaking participant by fusing the bearing measurements
from the auditory and visual modalities. The following dis-
cusses the proposed method, including models for the bearing
measurements and the system state dynamics, as well as the
use of boundary contraints imposed by the restricted meeting
room space.
A Sampling Importance Resampling (SIR) particle filter
is used. The importance density from which the particles
are sampled is chosen as the state transition probability
p(xt|xt−1), shown in (4). Due to this choice of importance
density, the optimal weight for each particle i ∈ [1, N ] can
M e e t i n gT a b l e
Speaker
Non-Speaker
x
y
Microphones
+
αt
Camera 2
Camera 1
γ 2,1t
γ 2,2t
γ 1,2t
γ 1,1t
Fig. 1. An example meeting room configuration with one microphone pairand two camera systems. Bearing measurements for both auditory and visualmodalities are shown.
be updated recursively using the measurement probability p(zt|xt) [4] as shown in (5).
xit ∼ p(xt|xit−1) (4)
wit = wi
t−1 p(zt|xit) (5)
Implementing this particle filter thus requires that the two
probability densities p(xt|xt−1) and p(zt|xt) be defined. A
state transition model and measurement model are used to
define these probabilities. In modeling the state transition, we
assume that the speaker position remains constant between
filter iterations and that any change in position is the result of
additive noise φt. The noise term φt is assumed to have a zero-
mean Gaussian distribution with covariance matrix Qt, and isdefined in the modified polar coordinate system. The value of
each particle i ∈ [1, N ] is updated as in (6) with a sample
φit drawn from this Gaussian distribution shown in (7). The
components of φt are assumed to be independent, as indicated
by the covariance matrix Qt being chosen as diagonal in (8).
xit = xit−1 + φit (6)
φit ∼ N (φt; 0,Qt) (7)
Qt =
σ2
1
ρ
0
0 σ2θ
(8)
As the meeting room is a known environment with known
dimensions, the particle set is constrained to lie within theboundaries of the room. This is achieved by testing each
sample φit drawn from the noise distribution in (7) to ensure
the new particle location xit is within the room boundaries. If
xit is outside the room boundaries, a new sample φit is drawn.
This process repeats until xit is within the room boundaries.
The bearing measurements αt and Γt from the audio and
video systems, respectively, are modeled as functions of the
state vector xt. The bearing measurements are relative to
the positions and orientations of the microphone pairs and
cameras. For each such sensor, a reference point and unit
vector are needed to define respectively the position of the
2009 International Conference on Signals, Circuits and Systems
-2-
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 3/6
sensor and the direction of a measured angle of zero. This
reference point and unit vector are defined in the rectangular
coordinate system. The modeling of a bearing measurement
from each of these sensors to each particle is simplified by
also converting the particle set {xit}N i=1 to the rectangular
coordinate system. This conversion is necessary since the
origin of the polar coordinate system used to model the system
dynamics is in general different from the reference point of
the bearing measurement sensors.
For a sensor with reference point p = { px, py} and unit
vector u = {ux, uy}, h(xit, p , u) is the model of the bearing
estimate to particle xit, and is given by (9).
h(xit, p , u) = atan2
xit − px, yit − py− atan2 (ux, uy) (9)
The bearing measurements provided by audio and video
have different probabilistic models, but use the same prediction
given by (9). Bearing measurements from a microphone pair
are modeled as Gaussian in (10). This gives more emphasis
to particles located near the measured bearing vector. As a
camera system can provide bearing measurements to multiple
targets, a sum-of-Gaussians model is used, shown in (11).Particles that are located near any of the measured bearing
vectors will be emphasized.
p(αt|xit) = N
αt − h(xit, p , u); 0, σ2
α
(10)
p(Γt|xit) =
Lt=1
N
γ t − h(xit, p , u); 0, σ2γ
(11)
By assuming independence of the various bearing measure-
ments, the full measurement probability is given by the product
in (12). This expression provides the means of updating the
weight of each particle using (5).
p(zt|xit) =
M m=1
p(αmt |x
it)
Kk=1
p(Γkt |x
it) (12)
Particles that lie near the intersection of two or more
bearing measurement vectors from two or more sensors will
receive high weight. These particles therefore have a high
probability of being resampled. The particle set will therefore
converge to a region where multiple bearing measurement
vectors intersect, which should correspond to the location of
the speaker. Figure 2 illustrates this convergence. Figure 2a
shows an initial particle set uniformly distributed over the
meeting room environment. After five iterations of the filter,
the particle set converges to the area shown in Figure 2b.
The particle set is now concentrated around the intersection of
the audio bearing measurement and one of the visual bearing
measurements, providing an estimate of speaker position using
only bearing measurements.
III. A VIDEO MESSAGING APPLICATION
In this section we present an application of the proposed
tracking method in a reduced scope situation involving video
messaging. One microphone pair (M = 1) and one camera
(K = 1) are used to detect and track the speaker in an area
in front of a computer display. The microphones and camera
a
M e e t i n gT a b l e
Speaker
Non-Speaker
x
y
Microphones
+
αt
Camera 2
Camera 1
γ 2,1t
γ 2,2t
γ 1,2t
γ 1,1t
b
M e e t i n gT a b l e
Speaker
Non-Speaker
x
y
Microphones
+
αt
Camera 2
Camera 1
γ 2,1t
γ 2,2t
γ 1,2t
γ 1,1t
Fig. 2. Illustration of the convergence of the particle set to the intersectionpoints of the bearing measurement vectors.
are attached to the display. This situation is common to video
messaging applications, such as Skype.
This section is organized as follows. First, the experimental
setup is described in Section III-A. Second, bearing measure-
ments from audio and video are discussed in Section III-B and
Section III-C, respectively.
A. Equipment Setup
We use two Logitech QuickCam Fusion web cameras with
integrated microphones to provide sensory data. These areattached to a computer monitor. Audio signals from the two
microphones are used to determine the angular position of the
speaker relative to a reference point equidistant from the two
microphones. Images from the camera on the left are used to
measure the angular location of the meeting participants.
Images from the camera are acquired at 30 frames per
second with a resolution of 320 pixels × 240 pixels. The
microphone audio signals are sampled at 44.1 kHz. The Image
Acquisition Toolbox and Data Acquisition Toolbox in MAT-
LAB are used for this purpose. Acquisition from video and
audio sources occurs simultaneously. Data sets are collected
2009 International Conference on Signals, Circuits and Systems
-3-
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 4/6
and processed offline in order to precisely synchronize the
audio channels. This is discussed in depth in the following
section.
B. Audio Bearing Measurements
Estimating the bearing to the speaker using audio data is
performed in three stages: synchronizing the audio channels,
measuring the Time Difference of Arrival (TDOA) of thesound signal between channels, and finally calculating the
bearing estimate using this TDOA measurement.
Samples acquired from the two microphones at the same
time instant are not synchronized, as each audio channel
has a unique unknown time delay. To synchronize the audio
channels, a signal with a known frequency is used. At the
beginning of each data set, a signal with frequency 2 kHz
is output for 0.1 seconds by a computer speaker equidistant
from the two microphones. The acquired audio signals are time
shifted to synchronize with the beginning of this pulse. The
need to synchronize with an external signal is the principal
reason for performing testing offline using stored data.
The next step is to measure the TDOA of speech signalsreaching the two microphones. The cross-correlation of the
two audio signals within a fixed time window is used to
estimate this time difference. The window size is set to
T = 33.3 ms, and is equal to time between video frames.
Thus, we have bearing estimates from audio and video 30
times per second. The audio signal s is recorded by both
microphones m1 and m2, with a time delay D in m2 compared
with m1. The signals s1 and s2 from the two microphones
contain additive noise terms n1 and n2, shown in (13)–(14).
The value of this time delay is estimated from the peak in the
cross-correlation of s1 and s2 using (15); the cross-correlation
R̂12(τ ) is estimated by using (16) [5].
s1(t) = s(t) + n1(t) (13)
s2(t) = s(t + D) + n2(t) (14)
D = arg max R̂12(τ ) (15)
R̂12(τ ) =1
T
t0t0−T
s1(t)s∗2(t − τ )dt (16)
Having estimated the time delay between the audio chan-
nels, the bearing can be calculated to the sound source (the
speaker). The bearing estimate is relative to the point pmbetween the two microphones, the positions of which are given
by m1 and m2. The bearing angle α to the speaker is given
by (17), where v is the speed of sound [2]. This expressionis an approximation valid when the separation between the
microphones is small compared with the distance r to the
sound source, r |m1 −m2| [2].
α = arccos
D v
|m1 −m2|
(17)
C. Video Bearing Measurements
Bearing measurements from video are made in two stages:
first, by detecting or tracking the location of meeting par-
ticipants in the image, and second converting this image
location into a bearing estimate relative to the camera location.
The heads of the participants are detected using skin-region
segmentation. This detection is performed once per second
(every 30 frames) to reinitialize the candidates. Between
detections, the candidate regions are tracked using the mean
shift method [6]. The horizontal image position of the center
of each candidate region is transformed into a bearing estimate
to that candidate in the world coordinate system relative to the
camera.
1) Face Candidate Detection: Skin region segmentation is
performed in the YCbCr color space. Skin color has been
shown to cluster more tightly in chrominance color spaces
such as the YCbCr color space, simplifying skin region
detection for a wide array of individuals and illumination
conditions [7]. A multivariate Gaussian model of skin color
is created in the Cb and Cr image channels with mean μsand covariance matrix Σs. A binary skin image B is created
by classifying each pixel (i, j) as skin (‘1’) or non-skin (‘0’)
using a Mahalanobis distance criterion in (20). Fig. 3b shows
the result of this segmentation on an example image.
μs =
μCbμCr
(18)
Σs =
σ2Cb 00 σ2
Cr
(19)
B(i, j) =
1 if DM
Cb(i,j)Cr(i,j)
, μs, Σs
< 3
0 otherwise(20)
A morphological opening operation followed by a closing
operation are used to remove spurious pixels from the binary
image and to fill in gaps in detected regions, respectively.
The remaining connected components in the binary imageare tested against size and aspect ratio criteria to filter out
regions unlikely to be face regions. Components must have
a height and width of at least 20 pixels, and an aspect ratio
(ratio of component width to height) between 0.5 and 1. The
remaining components represent the face candidates, as shown
in Fig. 3c. An ellipse is defined for each candidate with
major axis equal to the height of the component, minor axis
equal to the width of the components, and centered at the
centroid of the component. The final candidate ellipse regions
are illustrated in Fig. 3d. The number of face candidates for
a given camera k ∈ [1, K ] is denoted Lkt .
2) Mean Shift Tracking of Face Candidates: Face candidate
detection is performed using only one frame per second. Forthe remaining 29 frames acquired during that second, the
detected face candidates are tracked using the mean shift
technique [6]. This technique uses a candidate region from the
previous frame as a model, and searches the new frame for a
region matching the model. In our implementation, the model
of face candidate ∈ [1, Lkt ] is the color histogram qk of the
pixels within the elliptical region defined by the segmentation
method described in Section III-C1. These regions are also
illustrated in Fig. 3d. The model qk is kept constant between
frames; it is not updated in each successive frame. Updating
the model can improve tracking robustness if the appearance
2009 International Conference on Signals, Circuits and Systems
-4-
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 5/6
a b
c d
Fig. 3. Face region detection. (a) Original image. (b) Skin pixel segmentationresult. (c) Skin regions after morphological filtering. (d) Elliptical facecandidates overlaid on original image.
of the tracked object changes over time. However, it can also
allow the tracker to diverge to another object. In our system,
the participants being tracked will be generally looking at
the screen, and thus their facial appearance will not change
significantly. Keeping the model qk constant ensures that the
image region most similar to the detected face region will be
tracked in each frame.
To accelerate the convergence of the mean shift tracker, a
crude motion model is used to predict the position of the face
region in the new frame. Denoting the center of the elliptical
face candidate of camera k at time t by ck,t = {ci
k,t , cj
k,t },
the predicted location of this center point in the frame at timet + 1 is given by (21). This equation assumes the center point
changes by the same amount as the previous frame; this change
is given by (22).
ck,t+1 = c
k,t + Δck,t (21)
Δck,t = ck,t − ck,t−1 (22)
3) Bearing Estimation: The image locations of the face
candidates are converted to bearing estimates relative to the
position of the camera in the world coordinate system. As we
are only interested in tracking the speaker in the horizontal
plane, only the horizontal image position cjk,t is required.
Transforming this image position into an angular measurementis given by (23), where W = 320 pixels is the width of the
image, and β = 78◦ is the field of view of the camera.
γ k,t = arctan
⎡⎢⎣ 1
1 −cj
k,t
0.5W
tan β
2
⎤⎥⎦ (23)
This equation (23) maps the horizontal image position of
a face candidate to an angular measurement of bearing. It is
similar to an equation given in [2] used to map a bearing
estimate from TDOA to an image coordinate. The range of
bearing measurements possible using this method are limited
TABLE IVALUE OF BEARING MEASUREMENT FOR CERTAIN KEY HORIZONTAL
IMAGE POSITIONS.
Horizontal image position (pixels) Bearing measurement (radians)
0 π2+
β
2
W 2
π2
W π2−
β
2
Candidate 1
Candidate 2
a b
γ 1t
γ 2t
β/2β/2
Fig. 4. Visual bearing measurements. (a) Elliptical face candidates overlaidon original image. (b) Bearing measurements resulting from these facecandidates.
by the field-of-view of the camera β . Table I gives the value of
bearing measurements for certain horizontal image positions.
Fig. 4a shows two face candidates which result in the bearing
estimates γ 1t and γ 2t shown in Fig. 4b; also shown is the field-
of-view of the camera.
IV. EXPERIMENTAL RESULTS
Results for audio-visual tracking of the active speaker in a
video messaging application are provided in this section. At
this time, experimental results are available only for the single
participant case. This work will be extended to the multiple
speaker case in the future.Audio and video experimental data was collected while the
speaker was positioned at a set of known positions relative
to the screen upon which the camera and microphones are
mounted. The test positions can be separated into two groups.
Group 1 has a fixed lateral position of x = 0m while the
distance from the monitor is varied from y = 0.5m to y =2.0m. Group 2 is a fixed distance of y = 1.0m from the
monitor while the lateral position is varied from x = −0.4m to
x = 0.4m. While the speaker was sitting at each test position,
10 seconds of audio and video data were recorded.
The average estimated speaker location was determined for
each data set using offline processing. Knowing the ground
truth speaker position for each data set, the average localiza-tion error for each data set was calculated.
The average errors for the positions in Group 1 are shown
in Fig. 5a. Shown in Fig. 5b is a parameter called “Dilution of
Precision (DOP),” calculated for each of the speaker locations.
This parameter gives an indication of how the geometry of the
sensors affects localization accuracy. DOP is used by Global
Positioning System (GPS) receivers to describe the effect
that satellite geometry has on the accuracy of the computed
receiver position [8]. A low DOP value indicates that satellite
geometry is favorable, while a high DOP value indicates that
geometry is poor.
2009 International Conference on Signals, Circuits and Systems
-5-
7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking
http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 6/6
0.5 1 1.5 20
0.2
0.4
0.6
0.8
Distance from monitor (meters)
A v e r a g e e r r o r ( m e t e r s )
(a) Localization error for Group 1 test positions
0.5 1 1.5 25
10
15
20
25
Distance from monitor (meters)
D i l u t i o n o f p r e c i s i o n
(b) Dilution of precision for Group 1 test positions
x error
y error
Fig. 5. (a) Position estimation error for Group 1 positions, where distancefrom the monitor was varied. (b) Dilution of precision for Group 1 positions.
To calculate DOP, we must first calculate the direction
cosine matrix G for the current sensor configuration as shown
in (24). DOP is calculated from G as shown in (25). Further
information on the calculation of DOP in the context of GPS
can be found in [8].
G =
cos α sin α
cos θ sin θ
(24)
DOP =
tr
(GT G)−1
(25)
It can be seen in Fig. 5b that DOP increases as the speaker
is located further away from the screen. Localization error also
increases, particularly in the y direction.
The average errors for the positions in Group 2 are shown inFig. 6a, and the DOP values shown in Fig. 6b. It can be seen
that the minimum values for both localization error and DOP
occur when the speaker is at a lateral position of x = 0m.
Error and DOP both tend to increase as the speaker moves
laterally in either the positive or negative x direction.
The apparent relationship between DOP and localization
error suggests that a better arrangement of sensors can be
found that improves localization accuracy.
V. CONCLUSION
This paper has provided a novel approach to audio-visual
speaker tracking in a smart meeting room environment thatuses bearing measurements from audio and video to triangulate
the position of the speaker. Bearing measurements from audio
are calculated using the Time Difference of Arrival (TDOA) of
speech signals reaching each pair of microphones in a micro-
phone array. A machine vision algorithm detects participants’
faces using skin color segmentation and tracks the segmented
regions using mean shift kernel tracking. The tracked skin
regions are transformed into bearing measurements relative to
the location of the camera in the world coordinate system. A
particle filter uses the bearing measurements from audio and
video to weight the particles spread around the meeting room
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.40
0.2
0.4
0.6
0.8
Lateral position (meters)
A v e r a g e e r r o r ( m e t e r s )
(a) Localization error for Group 2 test positions
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.411
12
13
14
15
Lateral position (meters)
D i l u t i o n o f p r e c i s i o n
(b) Dilution of precision for Group 2 test positions
x error
y error
Fig. 6. (a) Position estimation error for Group 2 positions, where lateralposition was varied. (b) Dilution of precision for Group 2 positions.
environment. The particle set converges to the position in the
meeting room where the bearing measurements intersect, thustriangulating the position of the speaker. Future extensions
to this work will test how well the proposed method tracks
multiple participants. Eventually, we would like to scale this
approach up to a meeting room environment using multiple
cameras and microphone pairs to detect and track multiple
meeting participants.
ACKNOWLEDGMENT
This work has been supported by Ontario Research Fund-
Research Excellence (ORE-RE) program through the project
“MUltimodal- SurvEillance System for SECurity-RElaTed ap-
plications (MUSES SECRET)” funded by the Government of
Ontario, Canada.
REFERENCES
[1] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan, “Audio-visual Probabilistic Tracking of Multiple Speakers in Meetings,” IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 2, pp. 601–616, 2007.
[2] Y. Chen and Y. Rui, “Real-Time Speaker Tracking Using Particle FilterSensor Fusion,” Proceedings of the IEEE , vol. 92, no. 3, pp. 485–494,2004.
[3] D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle FilteringAlgorithms for Tracking an Acoustic Source in a Reverberent Environ-ment,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 826–836,2003.
[4] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial onParticle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking,”
IEEE Transactions on Signal Processing, vol. 50, no. 2, pp. 174–188,2002.
[5] C. H. Knapp and G. C. Carter, “The Generalized Correlation Method forEstimation of Time Delay,” IEEE Trans. Acoust., Speech, Signal Process.,vol. ASSP-24, no. 4, pp. 320–327, 1976.
[6] D. Comaniciu and V. Ramesh, “Mean Shift and Optimal Prediction forEfficient Object Tracking,” in Proc. 2000 Int. Conf. Image Processing,vol. 3, 2000, pp. 70–73.
[7] S. L. Phung, A. Bouzerdoum, and D. Chai, “A Novel Skin Color Modelin YCbCr Color Space and its Application to Human Face Detection,” inProc. 2002 Int. Conf. Image Processing, vol. 1, 2002, pp. 289–292.
[8] J. J. Spilker, “Satellite Constellation and Geometric Dilution of Precision,”in The Global Positioning System: Theory and Applications, B. W.Parkinson and J. J. Spilker, Eds. American Institute of Aeronauticsand Astonautics, 1994, vol. I, pp. 177–208.
2009 International Conference on Signals, Circuits and Systems
-6-