particle filtering for bearing-only audio-visual speaker detection and tracking

7
Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking Andrew Rae, Member, IEEE, Alaa Khamis, Member, IEEE, Otman Basir, Member, IEEE, and Mohamed Kamel, Fellow, IEEE  Abstract —We present a method for audio-visual speaker de- tec tio n and tra cki ng in a smart mee tin g room en vir onment bas ed on bea rin g mea sur ements and par tic le ltering. Bear - ing measurements are determi ned using the Time Diffe ren ce of Arrival (TDOA) of the acous tic signa l re ach ing a pai r of mic rophones, and by tracki ng facial re gions in image s fr om monocular cameras. A particle lter is used to sample the space of possi ble speaker locati ons within the meet ing room, and to fuse the bearing measurements from auditory and visual sources. The proposed system was tested in a video messaging scenario, using a single participant seated in front of a screen to which a camera and microphone pair are attached. The experimental results show that the accuracy of speaker tracking using bearing meas urements is related to the location of the speaker relati ve to the loca tions of the came ra and micr ophon es, which can be quantied using a parameter known as Dilution of Precision.  Index Terms—Direction of arrival estimation, acoustic arrays, machine vision, position measurement, Monte Carlo methods I. I NTRODUCTION S PEAKER tracking has received attention in recent years for it s use in sma rt mee tin g roo m and classroom envi - ron men ts to ide nti fy and tra ck the act iv e par tic ipa nt. Thi s infor mati on is applicable to dist rib uted meetings where the location of the speaker can be used to zoom in on the speaking person, or to select the camera that provides the best view of the speak er [1]. This allo ws remo te meet ing part icip ants to obser ve the facial express ions and gest ures of the speaker , thereby facilitating natural human interaction. Speaker tracking is a relevant problem to the eld of mul- timodal target tracking, as it cannot be addressed using only audio or visual sensing modalities [2]. Audio-visual speaker tracking takes advantage of the complementary nature of the aud io and vis ual modali tie s to tra ck the loc ation of and /or identify a speaking person. For example, the location of the speaker can be determined by TDOA or steered beamforming methods, provided an array of microphones are available [3]. Howe ver , sound sour ce local izat ion is degr aded in rev erber- ant en viro nments, and not applicable duri ng the absence of A. Rae is formerly with the Patt ern Analysi s and Machine Intell igen ce (PAMI) research group, Department of Electrical and Computer Engineering, University of Waterloo, Canada. A. Kha mis is a Res ear ch Ass ist ant Pro fes sor wi th the Dep art men t of Electrical and Computer Engineering, University of Waterloo, Canada. O. Basir is Associate Director of the PAMI research group and an Asso- ciate Professo r in the Depa rtmen t of Elect rical and Comp uter Engineer ing, University of Waterloo, Canada. M. Kamel is Director of the PAMI research group and a Professor in the Department of Electrical and Computer Engineering, University of Waterloo, Canada. spe ech. Mac hin e vis ion can be use d to track peo ple in the environment (e.g., [1], [2]), however it may not be possible to determine decisively which person is speaking at a given time. Fusing auditory and visual measurements should thus enable the speaker to be located with greater reliability than either modality alone. We propose a me thod of aud io- visual spe aker tra cki ng usi ng bea rin g mea surements pro vid ed by aud io and vid eo subsystems. By positioning cameras and microphone arrays at separate locations throughout the meeting room environment, bear ing meas urements determi ned usin g the data from each sensor are used to estimate the two dimensional (x-y) position of the speaker in the room. Particle ltering is used to consider mult iple location hypothes es for the spea ker throu ghout the environment that are weighted by the bearing measurements from the audio and visual sources. The remainder of the paper is organized as follows. Sec- tion II describes the bearing-only tracking method based on a particle ltering framework. Section IV summarizes an im- plementation of the system and preliminary results. Section V gives conclusions and future research directions. II. BEARING- ONLY TRACKING METHOD Our app roa ch to aud io- vis ual spe aker tra cki ng is to use bea rin g mea sur ements to locate and tra ck the spe aker in a meeting room environment. The problem is formulated as a two-dimensional tracking problem, where the xy position of the speaker in the horizont al plan e is desired. Track ing the vertical posi tion of the speak er’ s face would prov ide usef ul inf ormati on tha t wou ld fac ilitat e an aut oma tic zoo m onto the speaker for more natural remote interaction, however the vertical position is ignored for simplicity at this time.  A. Notation We dene in this sec tio n the symbol s and not ati on use d in the audio-visual speaker tracking approach. The horizontal positi on of the speaker is ex pr es sed in a modi ed polar coordinate system by the state vector x t , dened in (1). 1 ρt is the reciprocal of the distance from the origin of the coordinate system to the speaker, and θ t is the angular position of the speaker. A polar coordinate system would use the distance ρ t rather than its reciprocal. The subscript t represents the time index. x t = 1 ρ t θ t (1) 2009 International Conference on Signals, Circuits and Systems -1- 978-1-4244-4398-7/09/$25.00 ©2009 IEEE

Upload: yassernasef5399

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 1/6

Particle Filtering for Bearing-Only Audio-Visual

Speaker Detection and TrackingAndrew Rae, Member, IEEE, Alaa Khamis, Member, IEEE, Otman Basir, Member, IEEE, and

Mohamed Kamel, Fellow, IEEE 

 Abstract—We present a method for audio-visual speaker de-tection and tracking in a smart meeting room environmentbased on bearing measurements and particle filtering. Bear-ing measurements are determined using the Time Differenceof Arrival (TDOA) of the acoustic signal reaching a pair of microphones, and by tracking facial regions in images frommonocular cameras. A particle filter is used to sample the spaceof possible speaker locations within the meeting room, and tofuse the bearing measurements from auditory and visual sources.The proposed system was tested in a video messaging scenario,using a single participant seated in front of a screen to which

a camera and microphone pair are attached. The experimentalresults show that the accuracy of speaker tracking using bearingmeasurements is related to the location of the speaker relativeto the locations of the camera and microphones, which can bequantified using a parameter known as Dilution of Precision.

 Index Terms—Direction of arrival estimation, acoustic arrays,machine vision, position measurement, Monte Carlo methods

I. INTRODUCTION

SPEAKER tracking has received attention in recent years

for its use in smart meeting room and classroom envi-

ronments to identify and track the active participant. This

information is applicable to distributed meetings where the

location of the speaker can be used to zoom in on the speaking

person, or to select the camera that provides the best view of 

the speaker [1]. This allows remote meeting participants to

observe the facial expressions and gestures of the speaker,

thereby facilitating natural human interaction.

Speaker tracking is a relevant problem to the field of mul-

timodal target tracking, as it cannot be addressed using only

audio or visual sensing modalities [2]. Audio-visual speaker

tracking takes advantage of the complementary nature of the

audio and visual modalities to track the location of and/or

identify a speaking person. For example, the location of the

speaker can be determined by TDOA or steered beamforming

methods, provided an array of microphones are available [3].However, sound source localization is degraded in reverber-

ant environments, and not applicable during the absence of 

A. Rae is formerly with the Pattern Analysis and Machine Intelligence(PAMI) research group, Department of Electrical and Computer Engineering,University of Waterloo, Canada.

A. Khamis is a Research Assistant Professor with the Department of Electrical and Computer Engineering, University of Waterloo, Canada.

O. Basir is Associate Director of the PAMI research group and an Asso-ciate Professor in the Department of Electrical and Computer Engineering,University of Waterloo, Canada.

M. Kamel is Director of the PAMI research group and a Professor in theDepartment of Electrical and Computer Engineering, University of Waterloo,Canada.

speech. Machine vision can be used to track people in the

environment (e.g., [1], [2]), however it may not be possible to

determine decisively which person is speaking at a given time.

Fusing auditory and visual measurements should thus enable

the speaker to be located with greater reliability than either

modality alone.

We propose a method of audio-visual speaker tracking

using bearing measurements provided by audio and video

subsystems. By positioning cameras and microphone arrays at

separate locations throughout the meeting room environment,bearing measurements determined using the data from each

sensor are used to estimate the two dimensional (x-y) position

of the speaker in the room. Particle filtering is used to consider

multiple location hypotheses for the speaker throughout the

environment that are weighted by the bearing measurements

from the audio and visual sources.

The remainder of the paper is organized as follows. Sec-

tion II describes the bearing-only tracking method based on

a particle filtering framework. Section IV summarizes an im-

plementation of the system and preliminary results. Section V

gives conclusions and future research directions.

I I . BEARING-ONLY TRACKING METHOD

Our approach to audio-visual speaker tracking is to use

bearing measurements to locate and track the speaker in a

meeting room environment. The problem is formulated as a

two-dimensional tracking problem, where the x–y position of 

the speaker in the horizontal plane is desired. Tracking the

vertical position of the speaker’s face would provide useful

information that would facilitate an automatic zoom onto

the speaker for more natural remote interaction, however the

vertical position is ignored for simplicity at this time.

 A. Notation

We define in this section the symbols and notation used

in the audio-visual speaker tracking approach. The horizontal

position of the speaker is expressed in a modified polar

coordinate system by the state vector xt, defined in (1). 1ρt

is

the reciprocal of the distance from the origin of the coordinate

system to the speaker, and θt is the angular position of the

speaker. A polar coordinate system would use the distance ρtrather than its reciprocal. The subscript t represents the time

index.

xt =

1ρtθt

(1)

2009 International Conference on Signals, Circuits and Systems

-1-978-1-4244-4398-7/09/$25.00 ©2009 IEEE

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 2/6

The particle filter uses a set of  N  samples {xit}N i=1 to

approximate the posterior probability distribution of the state

vector xt conditioned on a sequence of measurements z1:t ={z1, . . . , zt} [4], as shown in (2). Each particle i ∈ [1, N ]has a weight wi

t that represents its relative importance to the

particle set.

 p(xt|z1:t) =

N i=1

wi

tδ(xt − xi

t) (2)

The measurement vector zt contains the bearing measure-

ments from both the audio and video modalities at time instant

t. Let M  be the number of microphone pairs, and let K  be the

number of cameras. The location of microphones and cameras

are assumed known a priori. An estimate of bearing to the

sound source is made for each microphone pair m ∈ [1, M ]based on the time-difference of arrival of the sound signal to

each microphone in the pair. This bearing estimate is relative

to the midpoint between the two microphones in the pair [2].

We denote this bearing estimate as αmt .

Bearing estimates to the various meeting participants are

made by tracking the participants visually in the images

provided by each camera. Each such camera system k ∈ [1, K ]tracks multiple participants independently of the other camera

systems providing Lkt bearing estimates, one for each tracked

participant. The set of bearing estimates from camera k is

denoted Γkt = {γ 

k,t }

Lkt

=1.

The measurement vector zt is defined in (3) as the set of 

bearing measurements from all M  microphone pairs and all

K  camera systems.

zt =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

α1t

...

αM t

Γ1t

...

ΓKt

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

(3)

Fig. 1 shows an example meeting room configuration. One

pair of microphones (M  = 1) provides a bearing measurement

to the speaker αt using TDOA, while two camera systems

(K  = 2) give bearing estimates Γ1t and Γ2

t to the tracked

participants.

 B. Particle Filter Tracker 

A particle filtering approach is used to track the locationof speaking participant by fusing the bearing measurements

from the auditory and visual modalities. The following dis-

cusses the proposed method, including models for the bearing

measurements and the system state dynamics, as well as the

use of boundary contraints imposed by the restricted meeting

room space.

A Sampling Importance Resampling (SIR) particle filter

is used. The importance density from which the particles

are sampled is chosen as the state transition probability

 p(xt|xt−1), shown in (4). Due to this choice of importance

density, the optimal weight for each particle i ∈ [1, N ] can

M e  e  t  i   n  gT  a  b  l    e 

Speaker

Non-Speaker

x

y

Microphones

+

αt

Camera 2

Camera 1

γ 2,1t

γ 2,2t

γ 1,2t

γ 1,1t

Fig. 1. An example meeting room configuration with one microphone pairand two camera systems. Bearing measurements for both auditory and visualmodalities are shown.

be updated recursively using the measurement probability p(zt|xt) [4] as shown in (5).

xit ∼  p(xt|xit−1) (4)

wit = wi

t−1 p(zt|xit) (5)

Implementing this particle filter thus requires that the two

probability densities p(xt|xt−1) and p(zt|xt) be defined. A

state transition model and measurement model are used to

define these probabilities. In modeling the state transition, we

assume that the speaker position remains constant between

filter iterations and that any change in position is the result of 

additive noise φt. The noise term φt is assumed to have a zero-

mean Gaussian distribution with covariance matrix Qt, and isdefined in the modified polar coordinate system. The value of 

each particle i ∈ [1, N ] is updated as in (6) with a sample

φit drawn from this Gaussian distribution shown in (7). The

components of φt are assumed to be independent, as indicated

by the covariance matrix Qt being chosen as diagonal in (8).

xit = xit−1 + φit (6)

φit ∼ N (φt; 0,Qt) (7)

Qt =

σ2

1

ρ

0

0 σ2θ

(8)

As the meeting room is a known environment with known

dimensions, the particle set is constrained to lie within theboundaries of the room. This is achieved by testing each

sample φit drawn from the noise distribution in (7) to ensure

the new particle location xit is within the room boundaries. If 

xit is outside the room boundaries, a new sample φit is drawn.

This process repeats until xit is within the room boundaries.

The bearing measurements αt and Γt from the audio and

video systems, respectively, are modeled as functions of the

state vector xt. The bearing measurements are relative to

the positions and orientations of the microphone pairs and

cameras. For each such sensor, a reference point and unit

vector are needed to define respectively the position of the

2009 International Conference on Signals, Circuits and Systems

-2-

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 3/6

sensor and the direction of a measured angle of zero. This

reference point and unit vector are defined in the rectangular

coordinate system. The modeling of a bearing measurement

from each of these sensors to each particle is simplified by

also converting the particle set {xit}N i=1 to the rectangular

coordinate system. This conversion is necessary since the

origin of the polar coordinate system used to model the system

dynamics is in general different from the reference point of 

the bearing measurement sensors.

For a sensor with reference point p = { px, py} and unit

vector u = {ux, uy}, h(xit, p , u) is the model of the bearing

estimate to particle xit, and is given by (9).

h(xit, p , u) = atan2

xit − px, yit − py− atan2 (ux, uy) (9)

The bearing measurements provided by audio and video

have different probabilistic models, but use the same prediction

given by (9). Bearing measurements from a microphone pair

are modeled as Gaussian in (10). This gives more emphasis

to particles located near the measured bearing vector. As a

camera system can provide bearing measurements to multiple

targets, a sum-of-Gaussians model is used, shown in (11).Particles that are located near any of the measured bearing

vectors will be emphasized.

 p(αt|xit) =  N 

αt − h(xit, p , u); 0, σ2

α

(10)

 p(Γt|xit) =

Lt=1

 N 

γ t − h(xit, p , u); 0, σ2γ

(11)

By assuming independence of the various bearing measure-

ments, the full measurement probability is given by the product

in (12). This expression provides the means of updating the

weight of each particle using (5).

 p(zt|xit) =

M m=1

 p(αmt |x

it)

Kk=1

 p(Γkt |x

it) (12)

Particles that lie near the intersection of two or more

bearing measurement vectors from two or more sensors will

receive high weight. These particles therefore have a high

probability of being resampled. The particle set will therefore

converge to a region where multiple bearing measurement

vectors intersect, which should correspond to the location of 

the speaker. Figure 2 illustrates this convergence. Figure 2a

shows an initial particle set uniformly distributed over the

meeting room environment. After five iterations of the filter,

the particle set converges to the area shown in Figure 2b.

The particle set is now concentrated around the intersection of 

the audio bearing measurement and one of the visual bearing

measurements, providing an estimate of speaker position using

only bearing measurements.

III. A VIDEO MESSAGING APPLICATION

In this section we present an application of the proposed

tracking method in a reduced scope situation involving video

messaging. One microphone pair (M  = 1) and one camera

(K  = 1) are used to detect and track the speaker in an area

in front of a computer display. The microphones and camera

a

M e  e  t  i   n  gT  a  b  l    e 

Speaker

Non-Speaker

x

y

Microphones

+

αt

Camera 2

Camera 1

γ 2,1t

γ 2,2t

γ 1,2t

γ 1,1t

b

M e  e  t  i   n  gT  a  b  l    e 

Speaker

Non-Speaker

x

y

Microphones

+

αt

Camera 2

Camera 1

γ 2,1t

γ 2,2t

γ 1,2t

γ 1,1t

Fig. 2. Illustration of the convergence of the particle set to the intersectionpoints of the bearing measurement vectors.

are attached to the display. This situation is common to video

messaging applications, such as Skype.

This section is organized as follows. First, the experimental

setup is described in Section III-A. Second, bearing measure-

ments from audio and video are discussed in Section III-B and

Section III-C, respectively.

 A. Equipment Setup

We use two Logitech QuickCam Fusion web cameras with

integrated microphones to provide sensory data. These areattached to a computer monitor. Audio signals from the two

microphones are used to determine the angular position of the

speaker relative to a reference point equidistant from the two

microphones. Images from the camera on the left are used to

measure the angular location of the meeting participants.

Images from the camera are acquired at 30 frames per

second with a resolution of 320 pixels × 240 pixels. The

microphone audio signals are sampled at 44.1 kHz. The Image

Acquisition Toolbox and Data Acquisition Toolbox in MAT-

LAB are used for this purpose. Acquisition from video and

audio sources occurs simultaneously. Data sets are collected

2009 International Conference on Signals, Circuits and Systems

-3-

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 4/6

and processed offline in order to precisely synchronize the

audio channels. This is discussed in depth in the following

section.

 B. Audio Bearing Measurements

Estimating the bearing to the speaker using audio data is

performed in three stages: synchronizing the audio channels,

measuring the Time Difference of Arrival (TDOA) of thesound signal between channels, and finally calculating the

bearing estimate using this TDOA measurement.

Samples acquired from the two microphones at the same

time instant are not synchronized, as each audio channel

has a unique unknown time delay. To synchronize the audio

channels, a signal with a known frequency is used. At the

beginning of each data set, a signal with frequency 2 kHz

is output for 0.1 seconds by a computer speaker equidistant

from the two microphones. The acquired audio signals are time

shifted to synchronize with the beginning of this pulse. The

need to synchronize with an external signal is the principal

reason for performing testing offline using stored data.

The next step is to measure the TDOA of speech signalsreaching the two microphones. The cross-correlation of the

two audio signals within a fixed time window is used to

estimate this time difference. The window size is set to

T  = 33.3 ms, and is equal to time between video frames.

Thus, we have bearing estimates from audio and video 30

times per second. The audio signal s is recorded by both

microphones m1 and m2, with a time delay D in m2 compared

with m1. The signals s1 and s2 from the two microphones

contain additive noise terms n1 and n2, shown in (13)–(14).

The value of this time delay is estimated from the peak in the

cross-correlation of s1 and s2 using (15); the cross-correlation

R̂12(τ ) is estimated by using (16) [5].

s1(t) = s(t) + n1(t) (13)

s2(t) = s(t + D) + n2(t) (14)

D = arg max R̂12(τ ) (15)

R̂12(τ ) =1

 t0t0−T 

s1(t)s∗2(t − τ )dt (16)

Having estimated the time delay between the audio chan-

nels, the bearing can be calculated to the sound source (the

speaker). The bearing estimate is relative to the point pmbetween the two microphones, the positions of which are given

by m1 and m2. The bearing angle α to the speaker is given

by (17), where v is the speed of sound [2]. This expressionis an approximation valid when the separation between the

microphones is small compared with the distance r to the

sound source, r |m1 −m2| [2].

α = arccos

D v

|m1 −m2|

(17)

C. Video Bearing Measurements

Bearing measurements from video are made in two stages:

first, by detecting or tracking the location of meeting par-

ticipants in the image, and second converting this image

location into a bearing estimate relative to the camera location.

The heads of the participants are detected using skin-region

segmentation. This detection is performed once per second

(every 30 frames) to reinitialize the candidates. Between

detections, the candidate regions are tracked using the mean

shift method [6]. The horizontal image position of the center

of each candidate region is transformed into a bearing estimate

to that candidate in the world coordinate system relative to the

camera.

1) Face Candidate Detection: Skin region segmentation is

performed in the YCbCr color space. Skin color has been

shown to cluster more tightly in chrominance color spaces

such as the YCbCr color space, simplifying skin region

detection for a wide array of individuals and illumination

conditions [7]. A multivariate Gaussian model of skin color

is created in the Cb and Cr image channels with mean μsand covariance matrix Σs. A binary skin image B is created

by classifying each pixel (i, j) as skin (‘1’) or non-skin (‘0’)

using a Mahalanobis distance criterion in (20). Fig. 3b shows

the result of this segmentation on an example image.

μs =

μCbμCr

(18)

Σs =

σ2Cb 00 σ2

Cr

(19)

B(i, j) =

1 if  DM 

Cb(i,j)Cr(i,j)

, μs, Σs

< 3

0 otherwise(20)

A morphological opening operation followed by a closing

operation are used to remove spurious pixels from the binary

image and to fill in gaps in detected regions, respectively.

The remaining connected components in the binary imageare tested against size and aspect ratio criteria to filter out

regions unlikely to be face regions. Components must have

a height and width of at least 20 pixels, and an aspect ratio

(ratio of component width to height) between 0.5 and 1. The

remaining components represent the face candidates, as shown

in Fig. 3c. An ellipse is defined for each candidate with

major axis equal to the height of the component, minor axis

equal to the width of the components, and centered at the

centroid of the component. The final candidate ellipse regions

are illustrated in Fig. 3d. The number of face candidates for

a given camera k ∈ [1, K ] is denoted Lkt .

2) Mean Shift Tracking of Face Candidates: Face candidate

detection is performed using only one frame per second. Forthe remaining 29 frames acquired during that second, the

detected face candidates are tracked using the mean shift

technique [6]. This technique uses a candidate region from the

previous frame as a model, and searches the new frame for a

region matching the model. In our implementation, the model

of face candidate ∈ [1, Lkt ] is the color histogram qk of the

pixels within the elliptical region defined by the segmentation

method described in Section III-C1. These regions are also

illustrated in Fig. 3d. The model qk is kept constant between

frames; it is not updated in each successive frame. Updating

the model can improve tracking robustness if the appearance

2009 International Conference on Signals, Circuits and Systems

-4-

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 5/6

a b

c d

Fig. 3. Face region detection. (a) Original image. (b) Skin pixel segmentationresult. (c) Skin regions after morphological filtering. (d) Elliptical facecandidates overlaid on original image.

of the tracked object changes over time. However, it can also

allow the tracker to diverge to another object. In our system,

the participants being tracked will be generally looking at

the screen, and thus their facial appearance will not change

significantly. Keeping the model qk constant ensures that the

image region most similar to the detected face region will be

tracked in each frame.

To accelerate the convergence of the mean shift tracker, a

crude motion model is used to predict the position of the face

region in the new frame. Denoting the center of the elliptical

face candidate of camera k at time t by ck,t = {ci

k,t , cj

k,t },

the predicted location of this center point in the frame at timet + 1 is given by (21). This equation assumes the center point

changes by the same amount as the previous frame; this change

is given by (22).

ck,t+1 = c

k,t + Δck,t (21)

Δck,t = ck,t − ck,t−1 (22)

3) Bearing Estimation: The image locations of the face

candidates are converted to bearing estimates relative to the

position of the camera in the world coordinate system. As we

are only interested in tracking the speaker in the horizontal

plane, only the horizontal image position cjk,t is required.

Transforming this image position into an angular measurementis given by (23), where W  = 320 pixels is the width of the

image, and β  = 78◦ is the field of view of the camera.

γ k,t = arctan

⎡⎢⎣ 1

1 −cj

k,t

0.5W 

tan β

2

⎤⎥⎦ (23)

This equation (23) maps the horizontal image position of 

a face candidate to an angular measurement of bearing. It is

similar to an equation given in [2] used to map a bearing

estimate from TDOA to an image coordinate. The range of 

bearing measurements possible using this method are limited

TABLE IVALUE OF BEARING MEASUREMENT FOR CERTAIN KEY HORIZONTAL

IMAGE POSITIONS.

Horizontal image position (pixels) Bearing measurement (radians)

0 π2+

β

2

W 2

π2

W π2−

β

2

Candidate 1

Candidate 2

a b

γ 1t

γ 2t

β/2β/2

Fig. 4. Visual bearing measurements. (a) Elliptical face candidates overlaidon original image. (b) Bearing measurements resulting from these facecandidates.

by the field-of-view of the camera β . Table I gives the value of 

bearing measurements for certain horizontal image positions.

Fig. 4a shows two face candidates which result in the bearing

estimates γ 1t and γ 2t shown in Fig. 4b; also shown is the field-

of-view of the camera.

IV. EXPERIMENTAL RESULTS

Results for audio-visual tracking of the active speaker in a

video messaging application are provided in this section. At

this time, experimental results are available only for the single

participant case. This work will be extended to the multiple

speaker case in the future.Audio and video experimental data was collected while the

speaker was positioned at a set of known positions relative

to the screen upon which the camera and microphones are

mounted. The test positions can be separated into two groups.

Group 1 has a fixed lateral position of  x = 0m while the

distance from the monitor is varied from y = 0.5m to y =2.0m. Group 2 is a fixed distance of  y = 1.0m from the

monitor while the lateral position is varied from x = −0.4m to

x = 0.4m. While the speaker was sitting at each test position,

10 seconds of audio and video data were recorded.

The average estimated speaker location was determined for

each data set using offline processing. Knowing the ground

truth speaker position for each data set, the average localiza-tion error for each data set was calculated.

The average errors for the positions in Group 1 are shown

in Fig. 5a. Shown in Fig. 5b is a parameter called “Dilution of 

Precision (DOP),” calculated for each of the speaker locations.

This parameter gives an indication of how the geometry of the

sensors affects localization accuracy. DOP is used by Global

Positioning System (GPS) receivers to describe the effect

that satellite geometry has on the accuracy of the computed

receiver position [8]. A low DOP value indicates that satellite

geometry is favorable, while a high DOP value indicates that

geometry is poor.

2009 International Conference on Signals, Circuits and Systems

-5-

7/29/2019 Particle Filtering for Bearing-Only Audio-Visual Speaker Detection and Tracking

http://slidepdf.com/reader/full/particle-filtering-for-bearing-only-audio-visual-speaker-detection-and-tracking 6/6

0.5 1 1.5 20

0.2

0.4

0.6

0.8

Distance from monitor (meters)

   A  v  e  r  a  g  e  e  r  r  o  r   (  m  e   t  e  r  s   )

(a) Localization error for Group 1 test positions

 

0.5 1 1.5 25

10

15

20

25

Distance from monitor (meters)

   D   i   l  u   t   i  o  n  o   f  p  r  e  c   i  s   i  o  n

(b) Dilution of precision for Group 1 test positions

x error

y error

Fig. 5. (a) Position estimation error for Group 1 positions, where distancefrom the monitor was varied. (b) Dilution of precision for Group 1 positions.

To calculate DOP, we must first calculate the direction

cosine matrix G for the current sensor configuration as shown

in (24). DOP is calculated from G as shown in (25). Further

information on the calculation of DOP in the context of GPS

can be found in [8].

G =

cos α sin α

cos θ sin θ

(24)

DOP =

 tr

(GT G)−1

(25)

It can be seen in Fig. 5b that DOP increases as the speaker

is located further away from the screen. Localization error also

increases, particularly in the y direction.

The average errors for the positions in Group 2 are shown inFig. 6a, and the DOP values shown in Fig. 6b. It can be seen

that the minimum values for both localization error and DOP

occur when the speaker is at a lateral position of  x = 0m.

Error and DOP both tend to increase as the speaker moves

laterally in either the positive or negative x direction.

The apparent relationship between DOP and localization

error suggests that a better arrangement of sensors can be

found that improves localization accuracy.

V. CONCLUSION

This paper has provided a novel approach to audio-visual

speaker tracking in a smart meeting room environment thatuses bearing measurements from audio and video to triangulate

the position of the speaker. Bearing measurements from audio

are calculated using the Time Difference of Arrival (TDOA) of 

speech signals reaching each pair of microphones in a micro-

phone array. A machine vision algorithm detects participants’

faces using skin color segmentation and tracks the segmented

regions using mean shift kernel tracking. The tracked skin

regions are transformed into bearing measurements relative to

the location of the camera in the world coordinate system. A

particle filter uses the bearing measurements from audio and

video to weight the particles spread around the meeting room

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.40

0.2

0.4

0.6

0.8

Lateral position (meters)

   A  v  e  r  a  g  e  e  r  r  o  r   (  m  e   t  e  r  s   )

(a) Localization error for Group 2 test positions

 

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.411

12

13

14

15

Lateral position (meters)

   D   i   l  u   t   i  o  n  o   f  p  r  e  c   i  s   i  o  n

(b) Dilution of precision for Group 2 test positions

x error

y error

Fig. 6. (a) Position estimation error for Group 2 positions, where lateralposition was varied. (b) Dilution of precision for Group 2 positions.

environment. The particle set converges to the position in the

meeting room where the bearing measurements intersect, thustriangulating the position of the speaker. Future extensions

to this work will test how well the proposed method tracks

multiple participants. Eventually, we would like to scale this

approach up to a meeting room environment using multiple

cameras and microphone pairs to detect and track multiple

meeting participants.

ACKNOWLEDGMENT

This work has been supported by Ontario Research Fund-

Research Excellence (ORE-RE) program through the project

“MUltimodal- SurvEillance System for SECurity-RElaTed ap-

plications (MUSES SECRET)” funded by the Government of 

Ontario, Canada.

REFERENCES

[1] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan, “Audio-visual Probabilistic Tracking of Multiple Speakers in Meetings,” IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 2, pp. 601–616, 2007.

[2] Y. Chen and Y. Rui, “Real-Time Speaker Tracking Using Particle FilterSensor Fusion,” Proceedings of the IEEE , vol. 92, no. 3, pp. 485–494,2004.

[3] D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle FilteringAlgorithms for Tracking an Acoustic Source in a Reverberent Environ-ment,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 826–836,2003.

[4] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial onParticle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking,”

 IEEE Transactions on Signal Processing, vol. 50, no. 2, pp. 174–188,2002.

[5] C. H. Knapp and G. C. Carter, “The Generalized Correlation Method forEstimation of Time Delay,” IEEE Trans. Acoust., Speech, Signal Process.,vol. ASSP-24, no. 4, pp. 320–327, 1976.

[6] D. Comaniciu and V. Ramesh, “Mean Shift and Optimal Prediction forEfficient Object Tracking,” in Proc. 2000 Int. Conf. Image Processing,vol. 3, 2000, pp. 70–73.

[7] S. L. Phung, A. Bouzerdoum, and D. Chai, “A Novel Skin Color Modelin YCbCr Color Space and its Application to Human Face Detection,” inProc. 2002 Int. Conf. Image Processing, vol. 1, 2002, pp. 289–292.

[8] J. J. Spilker, “Satellite Constellation and Geometric Dilution of Precision,”in The Global Positioning System: Theory and Applications, B. W.Parkinson and J. J. Spilker, Eds. American Institute of Aeronauticsand Astonautics, 1994, vol. I, pp. 177–208.

2009 International Conference on Signals, Circuits and Systems

-6-