people tracking with vision rf id

11
Vision and RFID data fusion for tracking people in crowds by a mobile robot T. Germa a,b, * , F. Lerasle a,b , N. Ouadah a,c , V. Cadenat a,b a CNRS, LAAS, 7, Avenue du Colonel Roche, F-31077 Toulouse, France b Université de Toulouse, UPS, INSA, INP, ISAE, LAAS-CNRS, F-31077 Toulouse, France c CDTA/ENP, Cité 20 août 1956, Baba Hassen, Alger, Algeria article info Article history: Received 13 January 2009 Accepted 5 January 2010 Available online 22 January 2010 Keywords: Radio frequency ID Multimodal data fusion Particle filtering Person tracking Person following Multi-sensor fusion Human visual servoing abstract In this paper, we address the problem of realizing a human following task in a crowded environment. We consider an active perception system, consisting of a camera mounted on a pan-tilt unit and a 360 RFID detection system, both embedded on a mobile robot. To perform such a task, it is necessary to efficiently track humans in crowds. In a first step, we have dealt with this problem using the particle filtering frame- work because it enables the fusion of heterogeneous data, which improves the tracking robustness. In a second step, we have considered the problem of controlling the robot motion to make the robot follow the person of interest. To this aim, we have designed a multi-sensor-based control strategy based on the tracker outputs and on the RFID data. Finally, we have implemented the tracker and the control strat- egy on our robot. The obtained experimental results highlight the relevance of the developed perceptual functions. Possible extensions of this work are discussed at the end of the article. Ó 2010 Elsevier Inc. All rights reserved. 1. Introduction Giving a mobile robot the ability of automatically following a person appears to be a key issue to make it efficiently interact with humans. Numerous applications would benefit from such a capa- bility. Service robotics is obviously one of these applications, as it requires interactive robots [16] able to follow a person to provide continual assistance in office buildings, museums, hospital envi- ronments, or even in shopping centers. Service robots clearly need to move in ways that are socially suitable for people. Such a robot have to localize its user, to discriminate him/her from others pass- ers-by and to be able to follow him/her across complex human- centered environment. In this context, tracking a given person in crowds from a mobile platform appears to be fundamental. How- ever, numerous difficulties arise: moving cameras with limited view field, cluttered background, illumination variations, hard real-time constraints, and so on. The literature offers many tools to go beyond these difficulties. Our paper focuses on particle filtering framework as it easily en- ables to fuse heterogeneous data from embedded sensors. Despite their sporadicity, these dedicated person detectors and their hard- ware counterpart are very discriminant when present. The paper is organized as follows. Section 2 depicts an overview of the corresponding works done within our robotic context and introduces our contributions. Section 3 describes our omnidirec- tional RFID prototype. This sensor is very discriminant when present in order to detect the user wearing an RFID tag. Section 4 recalls some PF basics and details our new importance function for multimodal person tracking. The developed control strategy to achieve a person following task in a crowded environment is detailed in Section 5, while Section 6 presents the mobile robot which has been used for our tests and the obtained results. Finally, Section 7 summarizes our contributions and discusses future extensions. 2. Overview and related work Particle filters (PF) [5] through different schemes are currently investigated for person tracking in both robotics and vision com- munities. Besides the well-known CONDENSATION scheme, the fairly seldom exploited ICONDENSATION [26] variant steers sam- pling towards state space regions of high likelihood by incorporat- ing both the dynamics and the measurements in the importance function. PF represent the posterior distribution by a set of sam- ples, or particles, with associated importance weights. This weighted particles set is first drawn from an importance function and the state vector initial probability distribution, and is then up- dated over time taking into account the measurement models. Some approaches e.g. [34] show that intermittent and discriminant cues based on person detection and recognition functionalities 1077-3142/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2010.01.008 * Corresponding author. Address: CNRS, LAAS, 7, Avenue du Colonel Roche, F-31077 Toulouse, France. E-mail addresses: [email protected] (T. Germa), [email protected] (F. Lerasle), noua [email protected] (N. Ouadah), [email protected] (V. Cadenat). Computer Vision and Image Understanding 114 (2010) 641–651 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Upload: nauman

Post on 08-Dec-2015

223 views

Category:

Documents


2 download

DESCRIPTION

track the people with this technology

TRANSCRIPT

Computer Vision and Image Understanding 114 (2010) 641–651

Contents lists available at ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate/cviu

Vision and RFID data fusion for tracking people in crowds by a mobile robot

T. Germa a,b,*, F. Lerasle a,b, N. Ouadah a,c, V. Cadenat a,b

a CNRS, LAAS, 7, Avenue du Colonel Roche, F-31077 Toulouse, Franceb Université de Toulouse, UPS, INSA, INP, ISAE, LAAS-CNRS, F-31077 Toulouse, Francec CDTA/ENP, Cité 20 août 1956, Baba Hassen, Alger, Algeria

a r t i c l e i n f o

Article history:Received 13 January 2009Accepted 5 January 2010Available online 22 January 2010

Keywords:Radio frequency IDMultimodal data fusionParticle filteringPerson trackingPerson followingMulti-sensor fusionHuman visual servoing

1077-3142/$ - see front matter � 2010 Elsevier Inc. Adoi:10.1016/j.cviu.2010.01.008

* Corresponding author. Address: CNRS, LAAS, 7,F-31077 Toulouse, France.

E-mail addresses: [email protected] (T. Germa), [email protected] (N. Ouadah), [email protected] (V. Cadenat)

a b s t r a c t

In this paper, we address the problem of realizing a human following task in a crowded environment. Weconsider an active perception system, consisting of a camera mounted on a pan-tilt unit and a 360� RFIDdetection system, both embedded on a mobile robot. To perform such a task, it is necessary to efficientlytrack humans in crowds. In a first step, we have dealt with this problem using the particle filtering frame-work because it enables the fusion of heterogeneous data, which improves the tracking robustness. In asecond step, we have considered the problem of controlling the robot motion to make the robot followthe person of interest. To this aim, we have designed a multi-sensor-based control strategy based onthe tracker outputs and on the RFID data. Finally, we have implemented the tracker and the control strat-egy on our robot. The obtained experimental results highlight the relevance of the developed perceptualfunctions. Possible extensions of this work are discussed at the end of the article.

� 2010 Elsevier Inc. All rights reserved.

1. Introduction

Giving a mobile robot the ability of automatically following aperson appears to be a key issue to make it efficiently interact withhumans. Numerous applications would benefit from such a capa-bility. Service robotics is obviously one of these applications, as itrequires interactive robots [16] able to follow a person to providecontinual assistance in office buildings, museums, hospital envi-ronments, or even in shopping centers. Service robots clearly needto move in ways that are socially suitable for people. Such a robothave to localize its user, to discriminate him/her from others pass-ers-by and to be able to follow him/her across complex human-centered environment. In this context, tracking a given person incrowds from a mobile platform appears to be fundamental. How-ever, numerous difficulties arise: moving cameras with limitedview field, cluttered background, illumination variations, hardreal-time constraints, and so on.

The literature offers many tools to go beyond these difficulties.Our paper focuses on particle filtering framework as it easily en-ables to fuse heterogeneous data from embedded sensors. Despitetheir sporadicity, these dedicated person detectors and their hard-ware counterpart are very discriminant when present.

ll rights reserved.

Avenue du Colonel Roche,

[email protected] (F. Lerasle), noua.

The paper is organized as follows. Section 2 depicts an overviewof the corresponding works done within our robotic context andintroduces our contributions. Section 3 describes our omnidirec-tional RFID prototype. This sensor is very discriminant when presentin order to detect the user wearing an RFID tag. Section 4 recalls somePF basics and details our new importance function for multimodalperson tracking. The developed control strategy to achieve a personfollowing task in a crowded environment is detailed in Section 5,while Section 6 presents the mobile robot which has been used forour tests and the obtained results. Finally, Section 7 summarizesour contributions and discusses future extensions.

2. Overview and related work

Particle filters (PF) [5] through different schemes are currentlyinvestigated for person tracking in both robotics and vision com-munities. Besides the well-known CONDENSATION scheme, thefairly seldom exploited ICONDENSATION [26] variant steers sam-pling towards state space regions of high likelihood by incorporat-ing both the dynamics and the measurements in the importancefunction. PF represent the posterior distribution by a set of sam-ples, or particles, with associated importance weights. Thisweighted particles set is first drawn from an importance functionand the state vector initial probability distribution, and is then up-dated over time taking into account the measurement models.Some approaches e.g. [34] show that intermittent and discriminantcues based on person detection and recognition functionalities

642 T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651

must be considered in the importance function in order to: (i)automatically re-initialize the tracker on the targeted person whenfailures occur and (ii) simplify the data association problem in pop-ulated settings [9].

Primary, embedded detectors are generally restricted to sta-tionary robots in order to (only) segment moving people fromthe background [40,46]. Some works [10,17,33] consider fore-ground segmentation based on disparity maps given a stereoscopichead [32], but they generally require significant CPU resources.Other techniques assume that people coarsely face the robot. Inthose cases, face detection [6,23,42] can be applied to (re)-initial-ize successfully the tracker after temporary occlusions, out of cam-era view field, or target losses. These multi-view face detectorshave received an increasing interest due to their computationalefficiency. Such detectors have been extended to the full or upperhuman body detection [2,39,47]. Some complementary ap-proaches combine person detection and recognition [18,46] in or-der to distinguish the targeted person from others. Nevertheless,despite many advances, a major problem – sensitivity to poseand illumination – still exists and a complete and on-boarded reli-able visual-based solution that can be used in general conditions isnot currently available. Clearly, using on-boarded monocular sys-tem to sense humans is very challenging compared to static anddeported systems. Thus, face detection and skin color detectionare only available when the person faces towards the robot, andthe robot can hardly follow behind or even walk next to the per-son. Full or upper human body detectors based on supervisedlearning are inappropriate to cover all the person’s range (from0.5 to 5 m) and orientation1 encountered when sensing from a mo-bile robot. Consequently, recent trends lead to methods based onmultimodal sensor fusion. Their issue is generally to use the videostream as the primary sensor and other sensor streams as secondaryones.

Beyond visible spectrum vision, thermal vision allows to over-come some of the aforementioned limitations, since humans havea distinctive thermal profile with respect to non-living objects.Moreover, their appearance does not depend anymore on lightconditions Yet, up to now, there are very few published works onusing both thermal and visible cameras on mobile robots to de-tect/track humans (see a survey on thermal vision [22]). We canhere mention the well-known PF proposed by Cielniak et al. in[12] which uses thermal vision for human detection and colorimages for capturing the appearance. Unfortunately, in crowds,sensing with thermal cameras leads to an abundance of additionalhot spots. It is then impossible to identify a given person as all hu-mans (and also all living objects. . .) stick out as white regions on ablack background.

Some other multimodal systems devoted to person tracking uti-lize audio and visual sensors [8,7,33,34]. In crowds, the data associa-tion problem can be settled by speaker identification [28,45].Nevertheless, sensing people with audio cues during the robot orthe customers movement is questionable. Indeed, the variabilitygenerated by the speaker, the recording conditions, the backgroundnoise especially in crowded environments, the inherent intermit-tence of the voice stream (as humans do not babble all the time)are the main difficulties which have to be overcome. Therefore, thespeaker identification problem appears to be a challenging problemand still remains an open issue for further research.

Using laser range finders for person tracking is also frequent inthe robotics community. In contrast to cameras, lasers provideaccurate depth information, require little processing and are insen-sitive to ambient lighting. The classical strategy consists in extract-ing legs from a 2D laser scan at a fixed height. To this end, two

1 The person can walk towards, away from, or past the robot, side-by-side, etc.

particular types of features are intensively studied: motion[14,36] and geometry [4,6,29,39,47] features. Many multi-sensorfusion systems integrate the data provided by a laser range finderand a perspective [6,14,39] or omnidirectional [29,47] camera.Anyway, systems involving laser scans suffer from several draw-backs. Leg detection in a 2D scan does not provide robust featuresfor discriminating the different persons in the robot vicinity, whilethe detector fails when one leg occludes the other.

Recent person tracking approaches have focused on indoor posi-tioning systems based on wireless networking indoor infrastructureand ultrasound, infrared [37], or radio frequency badged humans’clothes [3,11,21,27,35]. Radio frequency (RF) signals are widelyused as they: (i) can penetrate through most of the building mate-rial, (ii) have an excellent range in indoor environments, and (iii)have less interferences with other frequency components. More-over, RFID tags are preferred to accelerometers for aestheticaland ergonomical reasons [24,38,43]. Common applications involv-ing RFID technologies [3,11,27,31,37] assume stationary readersdistributed throughout the settings, namely ubiquitous sensors.Solely Schutz et al. [37] considered the multimodal people trackingfrom a network of RF sensors and laser range finders placedthroughout an environment. Our approach privileges on-boardperceptual resources (monocular color vision and RF reader) in or-der to limit the hardware installation cost and therefore the indoorsetting support. We can here mention the approach proposed in[21] which considers an on-board RF device for people detection.However, the detection range was limited to 180� and no multi-modal data fusion was done.

Fig. 1. RF multiplexing prototype to address eight antennas.

Fig. 2. Occurrence frequencies of angle htag given one (a), two (b) or three (c) detections.

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651 643

RFID sensors enjoy the nice properties to provide explicit infor-mation about the person identity, even if the location informationis relatively coarse. Our multimodal person tracker combines theaccuracy and information richness advantages of active color vi-sion with the identification certainty of RFID. This tracker, whichhas not been addressed in the literature, is expected to be moreresilient to occlusions than vision-based only systems, since theformer benefits from a coarse estimate of people location inaddition to the knowledge of his/her appearance. Furthermore,the ID-sensor can act as reliable stimuli that triggers the visionsystem. Finally, when several people lie in the camera view field,2

this multimodal sensor data fusion will help in distinguishing thetargeted person from the others.

The contributions of the paper is threefold. The first contributionof this paper is the customization of an off-the-shelf RFID system tomake it able to detect tags in 360� view field, by multiplexing eightantennas. We have embedded this system on our mobile robot Rack-ham to detect passive RFID-tagged persons. This omnidirectional ID-sensor, unaffected by lighting conditions or humans’ appearance,appears as an ideal complement to trigger a PTU-mounted perspec-tive camera. The second contribution concerns particle samplingwithin the ICONDENSATION scheme. We propose a genuine impor-tance function based on probabilistic saliency maps issued from vi-sual and RF person detector and ID as well as a rejection samplingmechanism to (re)-positions samples on the desired person duringtracking. This particle sampling strategy, which is unique in the lit-erature, should improve our multi-sensor based tracker so that it be-comes much more resilient to occlusions, data association, andtarget losses than vision-based only systems. The last contributionconcerns a multi-sensor-based control making the mobile robot reli-ably follow in real time a person in a more difficult setting than otherprevious works [10,19,29].

3. Person detection and identification based on RFID

3.1. Device description

The device consists of: (i) A CAENRFID3 A941 multiprotocol off-the-shelf reader which works at 870 MHz, with a programmableemitting RF power from 100 to 1200 mW. (ii) Eight directive anten-nas to detect the passive tags worn on the customer’s clothes. (iii) Aprototype circuit in order to sequentially initialize each antenna(Fig. 1). With a single antenna, only a tag angle relative to the anten-na plane can be estimated. With our eight antennas, the tag can be

2 In this case, there are multiple observations in the image plane.3 See http://www.caen.it/rfid/.

detected all around the robot at any distance between 0.5 m (i.e.approximately the robot’s radius) and 5 m. Given the placement ofthe antennas and their own field of view, the robot neighborhoodis divided into 24 areas (Fig. 3a), depending on the number of anten-nas simultaneously detecting the RFID tag.

To determine the observation model of the whole antenna set,statistics are performed by counting frequencies depending onthe number of antennas (three at a maximum, Fig. 3a) that detectthe tag. The resulting normalized histograms are shown in Fig. 2where the x-axis represents the azimuthal angle htag . Similar histo-grams can be observed for the distance dtag .4 The resulting sensormodel makes the simplifying assumption that both azimuth and dis-tance histograms can be approximated by Gaussians respectively de-fined by ðlhtag

; rhtag Þ and ðldtag; rdtag Þ, where lð�Þ and rð�Þ are the mean

and standard deviation. Afterwards, we project these probabilitiesfor the current tag position to a saliency map of the floor. The sizeof the saliency map is 300� 300 pixels; thus the area of each pixelrepresents 7 cm2. Each pixel probability is calculated given the 8-an-tenna set outputs to approximate the RFID tag position (Fig. 3). Thethree rightmost plots in Fig. 3 respectively shows the saliency mapsfor the detection by one, two or three antennas. Given this observa-tion model, evaluations allow to characterize the ID-sensorperformances.

3.2. Evaluations from feasibility study

The RF system has been mounted on our mobile robot Rackham(Section 6) and evaluated in the presence of people. We have pro-ceeded in the following way. We have generated statistics bycounting frequencies on a 81 m2 area around the robot. Obstacleshave been added one by one during the test runs. Their positionshave been randomly chosen and uniformly distributed in this area.The corresponding ground-truth is based on the ratio between theoccluding zones induced by obstacles (assuming an average per-son-width of 40 cm) and the total area.

Given such various ‘‘crowdedness” situations, the RFID tag hasbeen moved around the robot assuming no self-occlusion by theperson wearing the tag during this evaluation. We have repeatedthis sequence for different distances and we have counted for everypoint in a discrete grid whether the tag worn by a fixed person isdetected or not, depending on the crowdedness. Comparisons be-tween experimental and theoretical detection rates are shown inFig. 4 (see the box-and-whisker plots).

The x-axis and y-axis respectively denote the number of occlud-ing persons (that is ‘‘crowdedness”) and the detection rate. Thebox plots and the thick stretches inside indicate the degree of

4 They are not presented here to save space, but they are available on request.

Fig. 3. Azimuthal view field of eight antennas (a) and saliency map for tag detection respectively for 1 (b), 2 (c) and 3 (d) antennas.

Fig. 4. Detection rate versus crowdedness in the robot surrounding.

644 T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651

dispersion (for 50% of the trials) and the median of the trials. Ourexperimental curves are shown to be rather close to the theoreticalones. As the system is disturbed by the occlusions, the number offalse-negative readings logically increases with the number ofobstacles. Nevertheless, the detection rate remains satisfactory,even for overcrowded scenes (e.g. 70% in average for seven personsstanding around the robot). Furthermore, very few false-positivereadings (reflections, detections with the wrong antennas. . .) areobserved in practice.5

4. Person tracking using vision and RFID

4.1. Basics on particle filters and data fusion

Particle filters (PF) aim at recursively approximating the poster-ior probability density function (pdf) pðxkjz1:kÞ of the state vector xk

at time k conditioned on the set of measurements z1:k ¼ z1; . . . ; zk. Alinear point-mass combination

pðxkjz1:kÞ �XN

i¼1

wðiÞk d xk � xðiÞk

� �;XN

i¼1

wðiÞk ¼ 1; ð1Þ

is determined where dð�Þ is the Dirac distribution. It expresses theselection of a value – or ‘‘particle” – xðiÞk with probability – or‘‘weight” – wðiÞk ; i ¼ 1; . . . ;N. An approximation of the conditionalexpectation of any function of xk, such as the MMSE6 estimateEpðxk jz1:kÞ½xk�, then follows.

5 Passive tags induce few signal reflections contrary to their active counterparts.6 For ‘‘Medium Mean Square Estimate”.

Algorithm 1 Generic particle filtering algorithm (SIR)

Require: xðiÞk�1;wðiÞk�1

n oh iN

i¼1; zk

1: if k ¼ 0 then

2: Draw xð1Þ0 ; . . . ; xðiÞ0 ; . . . ; xðNÞ0 i.i.d. according to pðx0Þ, and set

wðiÞ0 ¼ 1N

3: end if

4: if k P 1 then [— xðiÞk�1;wðiÞk�1

n oh iN

i¼1being a particle description of

pðxk�1jz1:k�1Þ—]5: for i ¼ 1; . . . ;10 do

6: ‘‘Propagate” the particle xðiÞk�1 by independently sampling

xðiÞk � q xkjxðiÞk�1; zk

� �

7: Update the weight wðiÞk associated to xðiÞk according to

wðiÞk / wðiÞk�1p zk jx

ðiÞk

� �p xðiÞ

kjxðiÞ

k�1

� �q xðiÞ

kjxðiÞ

k�1;zk

� � ,

8: Prior to a normalization step so thatP

iwðiÞk ¼ 1

9: end for10: Compute the conditional mean of any function of xk , e.g. the

MMSE estimate Epðxk jz1:kÞ½xk�, from the approximationPNi¼1wðiÞk d xk � xðiÞk

� �of the posterior pðxkjz1:kÞ

11: At any time or depending on an ‘‘efficiency” criterion,

resample the description xðiÞk ;wðiÞk

n oh iN

i¼1of pðxkjz1:kÞ into the

equivalent evenly weighted particles set xðsðiÞÞ

k ; 1N

n oh iN

i¼1, by

sampling in f1; . . . ;Ng the indexes sð1Þ; . . . ; sðNÞ according to

PðsðiÞ ¼ jÞ ¼ wðjÞk ; set xðiÞk and wðiÞk to xðsðiÞÞ

k and 1N

12: end if

The ‘‘Sampling Importance Resampling” (SIR), shown in Algo-rithm 1, is fully described by the prior pðx0Þ, the dynamics pdfpðxkjxk�1Þ and the observation pdf pðzkjxkÞ. After initialization ofindependent identically distributed (i.i.d.) sequence drawn frompðx0Þ, the particles stochastically evolve, being sampled from an

importance function q xkjxðiÞk�1; zk

� �. They are then suitably

weighted to guarantee the consistency of the approximation (1).

To this end, step 7 assigns each particle xðiÞk a weight wðiÞk involving

its likelihood p zkjxðiÞk

� �w.r.t. the measurement zk as well as the val-

ues of the dynamics pdf and importance function at xðiÞk . In order tolimit the well known degeneracy phenomenon [5], step 11 insertsa resampling stage introduced by Gordon et al. [20] so that the par-ticles associated with high weights are duplicated while the others

collapse and the resulting sequence xðsð1ÞÞ

k ; . . . ;xðsðNÞÞ

k is i.i.d. accord-ing to (1).

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651 645

The CONDENSATION – for ‘‘Conditional Density Propagation”[25] – is the instance of the SIR algorithm such that the particlesare drawn according to the system dynamics, viz. when

q xkjxðiÞk�1; zk

� �¼ p xkjxðiÞk�1

� �. Then, in visual tracking, the original

algorithm [25] defines the particles likelihoods from contour prim-itives, but other visual cues have also been exploited [34]. On thispoint, resampling may lead to a loss of diversity in the state spaceexploration. The importance function must thus be carefully de-

fined. As CONDENSATION draws the particles xðiÞk from the systemdynamics but ‘‘blindly” w.r.t. the measurement zk, many of them

may be assigned a low likelihood p zkjxðiÞk

� �and thus a low weight

in step 7, which significantly worsen the overall filter performance.An alternative, henceforth labeled ‘‘Measurement-based SIR”

(MSIR), merely consists in sampling the particles – or just someof their entries – at time k according to an importance functionpðxkjzkÞ defined from the current image. The first MSIR strategywas ICONDENSATION [26], which guided the state space explora-tion by a color blob detector. Other visual detection functionalitiescan be used as well, e.g. face detection/recognition (see below), orany other intermittent primitive which, despite its sporadicity, isvery discriminant when present [34]. Thus, the classical impor-tance function pð�Þ based on a single detector can be extended toconsider the outputs from L detection modules, i.e.

p xðiÞk z1k ; . . . ; zL

k

��� �¼XL

l¼1

jlp xðiÞk jzlk

� �; with

Xjl ¼ 1: ð2Þ

In an MSIR scheme, if a particle xðiÞk drawn exclusively from theimage (namely pð�Þ) is inconsistent with its predecessor xðiÞk�1 fromthe point of view of the state dynamics, the update formula leadsto a small weight wðiÞk . One solution to this problem, as proposed inthe genuine ICONDENSATION algorithm, consists in also samplingsome particles from the dynamics and some w.r.t. the prior so that:

q xðiÞk xðiÞk�1; zk

���� �¼ ap xðiÞk zkj

� �þ bp xkjxðiÞk�1

� �þ ð1� a� bÞp0ðxkÞ;

ð3Þwith a;b 2 ½0; 1�. Besides the importance function, the measurementfunction involves visual cues which must be persistent but are moreprone to ambiguity for cluttered scenes. An alternative is to con-sider multi-cue fusion in the weighting stage. Given L measurementsources z1

k ; . . . ; zLk

� �and assuming the latter are mutually indepen-

dent conditioned on the state, the unified measurement functioncan then be factorized as follows:

p z1k ; . . . ; zL

k xðiÞk

���� �/YL

l¼1

p zlk xðiÞk

���� �: ð4Þ

7 For ‘‘Support Vector Machine”.8 Indexes k and i are omitted for the sake of clarity and space.

4.2. Tracking implementation

The aim is to fit the template relative to the RFID-tagged personall along the video stream through the estimation of his/her imagecoordinates ðu;vÞ and its scale factor s of his/her head. All theseparameters are accounted for in the above state vector xk relatedto the kth frame. With regard to the dynamics pðxkjxk�1Þ, the imagemotions of humans are difficult to characterize over time. This weakknowledge is modeled by defining the state vector as xk ¼ ½uk;vk; sk�0

and assuming that its entries evolve according to mutually indepen-dent random walk models, viz. pðxkjxk�1Þ ¼Nðxk; xk�1;RÞ, whereNð:;l;RÞ is a Gaussian distribution with mean l and covarianceR ¼ diag r2

u;r2v ;r2

s

� �.

In both importance sampling and weight update steps, fusingmultiple cues enables the tracker to better benefit from distinctinformation, and decrease its sensitivity to temporary failures insome of the measurement processes. The underlying unified likeli-

hood in the weighting stage is more or less conventional. It is com-puted thanks to (4) by means of several measurement functions,according to persistent visual cues, namely: (i) edges to modelthe silhouette [25] and (ii) multiple color distributions to representthe person’s appearance (both head and torso) [34]. Despite itssimplicity, our measurement function is inexpensive while stillproviding some level of person discrimination from the clothesappearance. Otherwise, our importance function is unique in theliterature and is detailed below.

4.3. Importance function based on visual and RF cues

Given Eq. (2), three functions p xkjzck

� �; p xkjzs

k

� �and p xkjzr

k

� �,

respectively based on skin probability image, face detector andRF identification are considered.

The importance function p xkjzck

� �at location xk ¼ ðu;vÞ is de-

scribed by pðxjzcÞ ¼ hðczðxÞÞ, where czðxÞ is the color of the pixellocated in x in the input image zc . h is the 3D normalized histogramused for backprojection [30] indexed by R, G, B channels and rep-resents the color distribution of the skin which is a priori learnt.

The function p xkjzsk

� �relies on a probabilistic image based on

the well-known face detector pioneered by Viola et al. [41], andimproved in [42,44], which covers a range of 45� out-of planerotation. Let NB be the number of detected faces fFjgNB

j¼1 andpi ¼ ðui;v iÞ; i ¼ 1; . . . ;NB the centroid coordinate of each such re-gion. The face recognition technique, detailed in [18], involvestwo steps during the learning stage. The first one is composed ofPCA-based computation and multi-class SVM7 learning, while thesecond one uses a genetic algorithm free-parameters optimizationbased on NSGA-II. Finally, our on-line decision rule defines a poster-ior probability PðCt jF; zÞ of labeling face Fj to Ct so that:

8t PðCtjF; zÞ ¼ 0 and PðC;jF; zÞ ¼ 1 when 8tLt < s;8t PðCtjF; zÞ ¼ LtP

pLp and PðC;jF; zÞ ¼ 0 otherwise;

8<:where C; refers to the void class, s is one of the free-parameters ofthe system and Ct refers to the face basis of the RFID-tagged person.The function pð�Þ at location x ¼ ðu;vÞ is deduced using a weightedGaussian mixture proposal.8 Its expression is given hereafter:

pðxjzsÞ /XNB

j¼1

PðCjFj; zÞ �N x; pj;diag r2uj;r2

v j

� �� �;

where PðCjFj; zÞ is the face ID probabilities for each detected faceFj given beforehand learnt tracked person face. Given the RF out-puts, the function pðxðiÞk jzr

kÞ expresses as follows:

p xðiÞk jzr

� �¼N hxðiÞ

k;lhtag

;rhtag

� �;

where hxðiÞk

is the azimuthal position of the particle xðiÞk in the robotframe, deduced from its horizontal position on the image and thecamera pan angle. lhtag

and rhtag , described in Section 3, are respec-tively the mean and the covariance of the estimated position of thesole targeted tag associated to the user in the robot frame depend-ing on the antenna outputs.

The particle sampling is done using the importance function qð�Þgiven in Eq. (3) and requires a process of rejection sampling. Thisprocess constitutes an alternative when qð�Þ is not analyticallymodeled. The principle is described in Algorithm 2 with gð�Þ aninstrumental distribution to make the sampling easier under therestriction that qð�Þ < Mgð�Þ, where M > 1 is an appropriate boundon qð�Þ

gð�Þ.

Fig. 5. (a) Original image, (b) skin probability image p xkjzck

� �, (c) face detection p xkjzs

k

� �, (d) azimuthal angle from RFID detection p xkjzr

k

� �, (e) unified importance function (3)

(without dynamic), (f) accepted particles (yellow dots) after rejection sampling. (For interpretation of the references to color in this figure legend, the reader is referred to theweb version of this article.)

646 T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651

Algorithm 2. Rejection sampling algorithm

1: draw xðiÞk according to MgðxkÞ2: r qðxk jxk�1 ;zkÞ

Mg xðiÞk

� �3: draw u according to U½0;1�

4: if u 6 r then

5: accept xðiÞk

6: else7: reject it8: end if

9 The distance dtag provided by the RFID system is rather inaccurate.

Fig. 5 shows an illustration of the rejection sampling algorithmfor a given image. Our importance function (3) combined withrejection sampling ensures that the particles will be placed in therelevant areas of the state space i.e. concentrated on the trackedperson or potential candidate areas.

5. A sensor-based control law for person following task

Now, we address the problem of making the robot follow thetagged person. To this aim, we use the data provided by both thetracker and the RFID system. We first briefly present the consid-ered robotic system and the chosen control strategy, before detail-ing the different designed control laws.

5.1. Modeling the problem: the robot and the control strategy

Our robot Rackham depicted in Section 6 consists of a nonholo-nomic mobile base equipped with a RFID system and with a cam-era mounted on a pan/tilt unit (PTU). Four control inputs can thenbe used to act on our robot: the linear and angular mobile basevelocities ðv r ;xrÞ and the pan/tilt unit angular velocities ðxp;xtÞ.Our goal is to compute these four velocities so that the robot canefficiently and safely achieve the person following. Different con-trol strategies are available in the literature. In our case wherethe camera and the RFID tag are used to detect and track the user,it appears rather natural to consider visual servoing techniques[15,13]. These techniques allow to control a large panel of roboticsystems using image data provided by one (or several) cameras.However, although they are applied in a wide range of applications,the literature reports only few works which address the problem ofhuman-based multi-sensor servoing.

Here, we focus on this problem and our idea is to use both RFIDand tracker data to build our control laws. We have chosen to sep-arately design the necessary controllers to discouple at best thedifferent degrees of freedom of the camera. Analyzing the robotstructure shows that the two control inputs xr and xp appear tohave the same effect on the features movement in the image.Although this property can be used to perform additional objec-tives, such as obstacle detection and avoidance, here, we havesimply chosen to fix xp to zero, so that we control the features hor-izontal position in the image using a unique controller. Moreover,

using xr instead of xp allows to orientate the whole robotic sys-tem (and not only the camera) towards the targeted person,improving the task execution.

5.2. Control design

Thus, we aim at designing three controllers ðv r ;xr ;xtÞ to orien-tate the camera and to move the robot, so that the person to be fol-lowed is always kept in its line of sight and at a suitable socialdistance from the vehicle. To this aim, we use the data providedby the tracker, namely the coordinates of the head gravity centerin the image ðugc;vgcÞ and its associated scale sgc which coarselycharacterizes the H/R distance. From these data, we define the er-ror function Eptv to be decreased to zero:

Eptv ¼ ðEu Ev EsÞT ¼ ðugc � ui vgc � v i sgc � siÞT ;

where ui and v i represent the image center coordinates, si the pre-defined scale corresponding to an acceptable social distance de-noted by dfollow. Eu represents the abscissa error in the image andcan then be regulated to zero by acting on the robot angular speedxr . Ev corresponds to the ordinate error in the image and can be de-creased to zero thanks to the PTU tilt velocity xt and Es is the scaleerror regulated to zero using the robot linear velocity v r . We designthree PID controllers as follows:

xr ¼ KppEu þ KipR

Eudt þ KdpdEudt ;

xt ¼ KptEv þ KitR

Evdt þ KdtdEvdt ;

v r ¼ KpvEs þ KivR

Esdt þ KdvdEsdt ;

8>><>>:

ð5Þ

The control gains ðKpp;Kip;KdpÞ (respectively ðKpt;Kit;KdtÞ andðKpv ;Kiv ;Kdv Þ) are experimentally tuned to insure the systemstability.

However, these control laws can be used only when the target liesin the image. When the latter is lost, they cannot be applied anymoreand we use the RFID information, namely the distance dtag and theorientation htag to control the robot. The idea is then to make thecamera turn until the robot faces the tag, so that the tracker can re-trieve the tagged person in the camera view field. The correspondingrobot behavior is shown in Fig. 6. To this aim, we simply impose aconstant value x0

r for the robot angular velocity xr . The PTU speedxt is controlled so that the corresponding angle is brought back toits reference position, that is the position reached after each initial-ization of the PTU. We finally impose a linear velocity whose valuedepends on dtag and on htag . In this way, we try9 to keep on satisfyingthe constraint on the social distance dfollow, despite the visual informa-tion loss. The robot is then kept in the closest possible user’s neighbor-hood to ease the visual signal recovery. When the person is detectedanew, the control strategy switches back to the three vision-basedcontrollers given above. Note that the control law smoothness is pre-served when the switch between the vision-based controller and theRFID-based controller occurs. Indeed, the linear velocity is progres-sively modified to reach the new desired value. As for xr and xt , the

Fig. 6. RFID based robot behavior.

Fig. 7. Rackham.

Fig. 8. A typical run without human obstacles: the solid red/dash blue curvesrepresent robot/tagged person positions, while the thin purple lines show thedirection of the camera. (For interpretation of the references to color in this figurelegend, the reader is referred to the web version of this article.)

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651 647

continuity is not explicitly handled because the robot and the PTU aresufficiently rapid systems.

6. Integration and live experiments

6.1. Rackham description and software architecture

Rackham is an iRobot B21r mobile platform. Its standard equip-ment has been extended with one digital camera mounted on a Di-rected Perception pan-tilt unit, one ELO touch-screen, a pair ofloudspeakers, an optical fiber gyroscope, wireless Ethernet andthe RFID system previously described in Section 3 (Fig. 7). All thesedevices enable Rackham to act as a service robot in utilitarian pub-lic areas. It embeds robust Human Robot interaction abilities andefficient basic navigation skills.

We have developed three software modules, namely ICU whichstands for ‘‘I see you”, RFID and Visuserv, which respectivelyencapsulate human recognition/tracking, RFID localization, and vi-sual servoing. These modules have been implemented within the‘‘LAAS” architecture [1] using C/C++ interfacing scheme. TheOpenCV library10 is used for low-level features extraction e.g. edgeor face detection. The entire system operates at an average framerateof 6 Hz.

10 see ttp://sourceforge.net/projects/opencvlibrary/

6.2. Experiments and discussion

Experiments were conducted in our crowded robotic hallð4� 5 m2Þ. The goal is to control Rackham to follow a tagged per-son while respecting his/her personal space. The task requirementsare: (i) the user must be always centered in the image and (ii) adistance dfollow ¼ 2 m must be maintained between the tagged per-son and the robot. Up to now, the robot abilities of obstacle avoid-ance are rather coarse. Thus, we have chosen to set the robotmaximum driving and turning speeds at reduced values (respec-tively, 0.4 m/s and 0.6 rad/s) compatible with most targeted users’velocity.11 Numerous series of 10 runs have been carried out. Thescenario is as follows: a non-expert person enters the hall, picksup a RFID tag on Rackham, then moves in the hall without payingattention to the robot. The system performances are evaluated onthe whole experiments set and measured by:

The visual contact rate (VCR) defined by the ratio of the frameswhere the targeted person was in the field of view over the totalnumber of frames. This indicator indirectly measures the trackerrobustness to artifacts such as occlusions, sporadic target lossesdue to the crowds.

The following error (FE) defined by Efollow ¼ r � dfollow, where r isthe current range to the tagged person. This error measures therobot capability to follow the target at the desired distancedfollow.

Fig. 8 shows a typical run where the user is alone with the ro-bot. During this 4-step run, the sole vision or its multimodal coun-terpart will be considered. In such nominal conditions, Fig. 9 shows

11 If the user outpaces the robot and is lost, the vehicle is stopped for the safety sake.

Fig. 9. Snapshots of a trial. Notice that the pan-tilt unit azimuthal position is given by the red arch on the RFID map. The blue and green squares respectively depict the facedetection (person gazing the camera) and the MMSE estimate while the yellow dots represent the particles before the resampling step. (For interpretation of the references tocolor in this figure legend, the reader is referred to the web version of this article.)

0 10 20 30 40 50 60 70−0.4−0.20

0.20.4

v r

0 10 20 30 40 50 60 70

−0.20

0.20.40.6

ωr

0 10 20 30 40 50 60 700

200

400

d tag

0 10 20 30 40 50 60 70−0.4

−0.2

0

0.2

ωt

0 10 20 30 40 50 60 70−1

0

1

2

ICU

0 10 20 30 40 50 60 70−1

0

1

2

RFID

0 10 20 30 40 50 60 70−50

0

50

θ tag

)f()e()d()c()b()a(

Fig. 10. Synchronization of the data flow outputs between the different modules.

648 T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651

snapshots of the video streams with superimposed tracking out-puts as well as the current RFID saliency map.

Fig. 10 shows the following signals issued from the modulesICU, RFID, and Visuserv, namely: (i) the two flags ICU and RFID

Fig. 11. A typical run with human obstacles: the solid red/dash blue curvesrepresent robot/tagged person positions, while the thin purple lines show thedirection of the camera. The black arrow represent the passers-by paths. (Forinterpretation of the references to color in this figure legend, the reader is referredto the web version of this article.)

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651 649

which are respectively set to 1 when the tagged user is detectedeither in the image or in the RFID area; (ii) the angle htag and thedistance dtag; and (iii) the three control inputs ðv r ;xr ;xtÞ com-puted by the Visuserv module and sent to the robot.

After the mission initialization (a), the tracker focuses on thetagged person thanks to the video stream (b). The four steps ofthe scenario are then executed. Between points #1 and #2, the con-tact with the target is maintained thanks to the vision system. Thecontrol law is then computed using the visual data provided by thetracker. The target is centered in the image while the robot adjustsits relative distance thanks to the visual template scale. Betweenpoints #2 and #4 (c), the target disappears from the camera viewfield, which induces a visual tracker failure. The control law is thencomputed using the RFID data ðdtag ; htagÞ to make the robot face the

Fig. 12. Snapshots of a run in crowds. The first lin

Table 1Visual contact rate when considering 1–4 passers-by.

Sensor system Number of passers-by

1 2

l r l r

Vision only 0.21 0.11 0.22 0.02Vision + RFID 0.94 0.08 0.85 0.14

targeted person, and converge towards him/her until dtag reaches aclose neighborhood of dfollow. The person following task is then stillexecuted despite the visual target loss, while the RFID system trig-gers the camera in order to recover the target in the view field (d).Face detection/recognition allows to re-initialize the visual tracker(e) while the person goes back to #1 (f).

During this path, the robot trajectory crosses the target’s one. Asexpected, to preserve the social distance, it moves backward. Innominal conditions, the task is successfully performed. The averagefollowing error in this set of runs was 0.08 m without real impactof the RFID system.

In a second step, we have progressively added the number ofpeople in the robot vicinity to disturb the person following taskby sporadically occluding the tagged person. Fig. 11 shows a runincluding typical situations that may happen during tracking.Other people cross the path between Rackham and the tagged per-son, walk together with him and then cross the path again, evenwalk right behind Rackham for a long time making the robot stopand the tracker re-initialize.

Fig. 12 shows snapshots of this run while the entire video isavailable at the URL http://www.laas.fr/~tgerma/CVIU. In almostall cases, the multimodal tracker was able to cope with such ad-verse conditions while the vision only system fails to reset onthe correct person. The visual system succeeded in 12% of the mis-sions while more than 85% of them were successfully performedusing the multimodal counterpart. Table 1 shows the associated vi-sual contact rates when increasing the number of passers-by.

The average Visual Contact Rate remains almost constant forboth systems but these results highlight the multimodal trackerefficiency. In fact, the RFID system allows to keep the target inthe visual view field for more than l ¼ 85% of the duration ofthe video stream despite the presence of more than four passers-by. The high value of the standard-deviation (noted r) is mainly

e shows the current human robot situation.

Total

3 4 and more

l r l r l r

0.18 0.05 0.22 0.06 0.21 0.040.94 0.13 0.83 0.19 0.86 0.14

650 T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651

due to the random motions of the passers-by as they were asked tofreely walk around the robot. During these missions the averagefollowing error was Efollow ¼ 0:10 m.

7. Conclusion

Tracking provides important capabilities for human robot inter-action and assistance of humans in utilitarian populated spaces.The paper exhibits three contributions. A first contribution con-cerns the customization of an off-the-shelf RFID system to detecttags within a 360� view field and the coarse distance estimationthanks to the multiplexing of eight antennas. A second contribu-tion concerns the development of a multimodal person trackerwhich combines the accuracy advantages of monocular activevision with the identification certainty of such a RFID-based sen-sor. Our technique uses the ICONDENSATION scheme, heteroge-neous data driven proposals and rejection sampling mechanismto (re)-concentrate the particles on the right person during thesampling step. To our best knowledge, such multimodal data fu-sion within PF is unique in the robotics or vision literature. A thirdcontribution concerns the embeddability of this multimodal per-son tracker on a mobile robot and its coupling to the robot controlto follow a tagged person in crowds. The live experiments havedemonstrated that the person following task inherits the advanta-ges of both sensor types, thereby being able to robustly track peo-ple and identify them.

Several directions are currently studied regarding the wholesystem. First, we will design more compact antennas as embedda-bility is essential for autonomous robots. Second, the obstacledetection and avoidance problem will be addressed through an en-hanced multi-sensor-based control strategy. Further investigationswill also concern the algorithm extension to multiple persons asseveral RFID tags can be detected at the same time by the reader.The robot will then be able to interact with multiple humanssimultaneously, to track and avoid passers-by, and even to inter-pret human–human interaction in its vicinity.

Acknowledgments

The authors are very grateful to Léo Bernard and Antoine Ro-guez for their involvements in this work which was partially con-ducted within the EU STREP Project Commrob funded by theEuropean Commission Division FP6 under Contract FP6� 045441.

References

[1] R. Alami, R. Chatila, S. Fleury, F. Ingrand, An architecture for autonomy,International Journal of Robotic Research (IJRR’98) 17 (4) (1998) 315–337.

[2] M. Andriluka, S. Roth, B. Schiele, People-tracking by detection and peopledetection by tracking, in: International Conference on Computer Vision andPattern Recognition (CVPR’08), Anchorage, USA, June 2008.

[3] M. Anne, J. Crowley, V. Devin, G. Privat, Localisation intra-bâtiment multi-technologies: RFID, wifi et vision, in: National Conference on Mobility andUbiquity Computing (UbiMob’05), Grenoble, France, June 2005, pp. 29–35.

[4] K. Arras, O. Mozos, W. Burgard, Using boosted features for detection of peoplein 2D range scans, in: International Conference on Robotics and Automation(ICRA’07), Roma, Italy, April 2007.

[5] S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters foron-line non-linear/non-gaussian bayesian tracking, Transaction on SignalProcessing 50 (2) (2002) 174–188.

[6] N. Bellotto, H. Hu, Vision and laser data fusion for tracking people with amobile robot, in: International Conference on Robotics and Biomimetics(ICRB’06), Kunming, China, December 2006.

[7] H.J. Bohme, T. Wilhelm, J. Key, C. Schauer, C. Schroter, H.M. Gross, T. Hempel,An approach to multi-modal human–machine interaction for intelligentservice robot, Robotics and Autonomous Systems 44 (1) (2003) 83–96.

[8] M. Bregonzio, M. Taj, A. Cavallaro, Multimodal particle filtering tracking usingappearance, motion and audio likelihoods, in: International Conference onImage Processing (ICIP’07), San Antonio, USA, October 2007.

[9] L. Brèthes, F. Lerasle, P. Danès, M. Fontmarty, Particle filtering strategies fordata fusion dedicated to visual tracking from a mobile robot, Machine Visionand Applications (MVA’08) (2008), doi:10.1007/s00138-008-0174-7.

[10] D. Calisi, L. Iocchi, R. Leone, Person following through appearance models andstereo vision using a mobile robot, in: International Conference on ComputerVision Theory and Applications (VISAPP’07), Barcelona, Spain, March 2007.

[11] B. Castano, M. Rodriguez, An artificial intelligence and RFID system for peopledetection and orientation in big surfaces, in: International Multi-Conferenceon Engineering and Technological Innovation (IMETI’08), Orlando, USA, June2008.

[12] G. Cielniak, A. Lilienthal, T. Duckett, Improved data association and occlusionhandling for vision-based people tracking by mobile robots, in: InternationalConference on Intelligent Robots and Systems (IROS’07), San Diego, USA,November 2007.

[13] P.I. Corke, Visual Control of Robots: High Performance Visual Servoing,Research Studies Press Ltd., 1996.

[14] J. Cui, H. Zha, H. Zhao, R. Shibasaki, Multimodal tracking of people using laserscanners and video camera, Image and Vision Computing (IVC’08) 26 (2)(2008) 240–252.

[15] B. Espiau, F. Chaumette, P. Rives, A new approach to visual servoing in robotics,IEEE Transactions on Robotics and Automation 8 (3) (1992) 313–326.

[16] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots,Robotics and Autonomous Systems (RAS’03) 42 (2003) 143–166.

[17] D.M. Gavrila, Multi-cue pedestrian detection and tracking from a movingvehicle, International Journal of Computer Vision (IJCV’07) 73 (1) (2008) 41–59.

[18] T. Germa, F. Lerasle, T. Simon, Video-based face recognition and tracking froma robot companion, International Journal of Pattern Recognition and ArtificialIntelligence (IJPRAI’09) 23 (March) (2009) 591–616.

[19] R. Gockley, J. Forlizzi, R. Simmons, Natural person-following behavior for socialrobots, in: International Conference on Human Robot Interaction (HRI’07),Washington, USA, March 2007, pp. 17–24.

[20] N.J. Gordon, D.J. Salmond, A.F.M. Smith, Novel approach to nonlinear/non-gaussian bayesian state estimation, Radar and Signal Processing IEEProceedings F 140 (2) (1993) 107–113.

[21] D. Hahnel, W. Burgard, D. Fox, K. Fishkin, M. Philipose, Mapping andlocalization with RFID technology, in: International Conference on Roboticsand Automation (ICRA’04), April 2004, pp. 1015–1020.

[22] R. Hammoud, J. Davis, Advances in vision algorithms and systems beyond thevisible spectrum, Computer Vision and Image Understanding (CVIU’07) 106 (2)(2007) 145–147.

[23] C. Huang, H. Ai, Y. Li, S. Lao, High-performance rotation invariant multi-viewface detection, Transaction on Pattern Analysis Machine Intelligence (PAMI’07)29 (4) (2007) 671–686.

[24] T. Ikeda, H. Ishiguro, T. Nishimura, People tracking by cross modal association ofvision and acceleration sensors, in: International Conference on IntelligentRobots and Systems (IROS’07), San Diego, USA, November 2007, pp. 4147–4151.

[25] M. Isard, A. Blake, CONDENSATION – conditional density propagation for visualtracking, International Journal on Computer Vision 29 (1) (1998) 5–28.

[26] M. Isard, A. Blake, I-CONDENSATION: unifying low-level and high-leveltracking in a stochastic framework, in: European Conference on ComputerVision (ECCV’98), Freibourg, Germany, June 1998, pp. 893–908.

[27] T. Kanda, M. Shiomi, L. Perrin, T. Nomura, H. Ishiguro, N. Hagita, Analysis ofpeople trajectories with ubiquitous sensors in a science museum, in:International Conference on Robotics and Automation (ICRA’07), Roma, Italy,April 2007, pp. 4846–4853.

[28] B. Kar, S. Bhatia, P. Dutta, Audio-visual biometric based speaker identification,in: International Conference on Computational Intelligence and MultimediaApplications (ICCIMA’07), Sivakasi, India, December 2007, pp. 94–98.

[29] M. Kobilarov, G. Sukhatme, J. Hyams, P. Batavia, People tracking and followingwith mobile robot using an omnidirectional camera and laser, in: InternationalConference on Robotics and Automation (ICRA’06), Orlando, USA, May 2006,pp. 557–562.

[30] J. Lee, W. Lee, D. Jeong, Object tracking method using back-projection ofmultiple color histogram models, in: International Symposium on Circuits andSystems (ISCAS’03), June 2003.

[31] T. Mori, Y. Suemasu, H. Noguchi, T. Sato, Multiple people tracking byintegrating distributed floor pressure sensors and RFID system, in:International Conference on Systems, Man and Cybernetics, The Hague,Netherlands, October 2004, pp. 5271–5278.

[32] Rafael Muñoz Salinas, Miguel García-Silvente, Rafael Medina-Carnicer,Adaptive multi-modal stereo people tracking without background modelling,Journal of Visual of Communication and Image Representation 19 (2) (2008)75–91.

[33] K. Nickel, T. Gehrig, H. Ekenel, R. Stiefelhagen, J. McDonough, A joint particlefilter for audio–visual speaker tracking, in: International Conference onMultimodal Interfaces (ICMI’05), Torento, Italy, 2005, pp. 61–68.

[34] P. Pérez, J. Vermaak, A. Blake, Data fusion for visual tracking with particles,IEEE 92 (3) (2004) 495–513.

[35] S.S. Takahashi, J. Wong, M. Miyamae, A ZigBee-based sensor node for trackingpeople’s locations, in: ACM International Conference, Sydney, Australia, May2008, pp. 34–38.

[36] D. Schulz, W. Burgard, D. Fox, A. Cremers, Tracking multiple moving targetswith a mobile robot using particle filters and statistical data association, in:International Conference on Robotics and Automation (ICRA’01), Seoul, Korea,May 2001.

[37] D. Schulz, D. Fox, J. Hightower, People tracking with anonymous and ID-sensors using rao-blackwellised particle filters, in: International JointConference on Artificial Intelligence (IJCAI’03), Acapulco, Mexico, August 2003.

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641–651 651

[38] J. Smith, K. Fishkin, B. Jiang, A. Mamishev, RFID-based techniques forhuman-activity detection, Communications of the ACM 48 (9) (2005) 39–44.

[39] L. Spinello, R. Triebel, R. Siegwart, Multimodal detection and tracking ofpedestrians in urban environments with explicit ground plane extraction, in:AAAI Conference on Artificial Intelligence (AAAI’08), Chicago, USA, July 2008,pp. 1409–1414.

[40] Y. Tsai, H. Shih, C. Huang, Multiple human objects tracking in crowded scenes,in: International Conference on Pattern Recognition (ICPR’06), Hong Kong,August 2006, pp. 51–54.

[41] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simplefeatures, in: International Conference on Computer Vision and PatternRecognition (CVPR’01), 2001.

[42] P. Viola, M. Jones, Fast multi-view face detection, in: International Conferenceon Computer Vision and Pattern Recognition (CVPR’03), Madison, USA, June2003.

[43] H. Wang, H. Lenz, A. Szabo, J. Bamberger, U. Hanebeck, WLAN-basedpedestrian tracking using particle filters and low-cost MEMS sensors, in:Workshop on Positioning, Navigation and Communication (WPNC’07),Hannover, Germany, March 2007.

[44] Yan Wang, Yanghua Liu, Linmi Tao, Guangyou Xu, Real-time multi-view facedetection and pose estimation in video stream, Pattern Recognition, 2006, ICPR2006, in: 18th International Conference, vol. 4, pp. 354–357.

[45] L. Ying, S. Narayanan, C. Kuo, Adaptative speaker identification withaudiovisual cues for movie content analysis, Pattern Recognition Letters 25(7) (2004) 776–791.

[46] W. Zajdel, Z. Zivkovic, B. Kröse, Keeping track of humans: have I seen thisperson before? in: International Conference on Robotics and Automation(ICRA’05), Barcelona, Spain, April 2005, pp. 2093–2098.

[47] Z. Zivkovic, B. Kröse, Part based people detection using 2D range data andimages, in: International Conference on Robotics and Automation (ICRA’07),Roma, Italy, April 2007.