three-stage model for robust real-time face tracking

Three-Stage Model for Robust Real-Time Face Tracking

Jun-Hyeong Do, Zeungnam Bien

Division of Electrical Engineering, School of Electrical Engineering and Computer Science, KAIST,373-1 Guseong-Dong, Yuseong-Gu, Daejeon, Korea

Received 19 May 2007; accepted 16 January 2008

ABSTRACT: We propose a novel face tracking framework, the three-

stage model, for robust face tracking against interruptions from face-like blobs. For robust face tracking in real-time, we considered two

critical factors in the construction of the proposed model. One factor

is the exclusion of background information in the initialization of thetarget model, the extraction of the target candidate region, and the

updating of the target model. The other factor is the robust estimation

of face movement under various environmental conditions. The pro-

posed three-stage model consists of a preattentive stage, an assign-ment stage, and a postattentive stage with a prerecognition phase.

The model is constructed by means of effective integration of opti-

mum cues that are selected in consideration of the trade-off between

true positives and false positives of face classification based on acontext-dependant type of categorization. The experimental results

demonstrate that the proposed tracking method improves the per-

formance of the real-time face tracking process in terms of successrates and with robustness against interruptions from face-like

blobs. VVC 2008 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 17, 321–

327, 2007; Published online in Wiley InterScience (www.interscience.wiley.

com). DOI 10.1002/ima.20126

Key words: face tracking; multi-stage model; multiple cues;

prerecognition phase; candidate blob

I. INTRODUCTION

In the process of human face tracking for a given video sequence, a

face tracker often fails to track targets under various environmental

conditions such as varying illumination and size of the tracked face,

skin-colored background, abrupt motion of the face, and occlusion

by other objects. The type of failure in which a tracker tracks face-

like objects instead of a face is induced by the fact that the tracker

lacks the capability of accurately distinguishing between faces and

face-like objects.

A typical real-time face tracking process contains three phases:

an initialization phase, in which a target model is initialized when a

face is detected; a localization phase, where the tracker finds the

most similar target candidate region in comparison with the target

model; and the update phase, in which the target model is updated

in response to varying environmental conditions. To implement a

robust face tracker in various environmental conditions and real-

time, we considered two critical factors of the tracking process.

The first critical factor is that a technique is required for exclud-

ing as much background information as possible from the target

model and target candidate region, even if face-like objects are in

the background. The target model and the target candidate region

that contain the background information recursively produce inac-

curate target localization and inaccurate updates of the target

model, thereby causing tracking failure. In particular, when the dis-

tributions of face features and background features are similar, the

tracker may easily converge to a face-like object in the background.

For example, color information, which is easy to implement and

insensitive to the variation of pose, expression, scale, and rotation,

is used in most face tracking processes and yields a high recognition

rate; however, if the tracker relies solely a color cue, the perform-

ance can be affected by other skin-tone objects. To complement

this defect, several researchers have used multiple cues. Triesch and

Malsburg (2001) integrated five different visual cues to detect and

track a single face by calculating the weighted sum of each cue’s

observational probability. On the other hand, Cordea et al. (2001)

and Wu and Huang (2004) combined the color cue and the face

contour on the basis of edge detection to increase the robustness

and generality, though this method is time-consuming and often

fails to track the face in a skin-colored complex background due to

the detection error of the face contour. In another approach, Coma-

niciu et al. (2003) derived a simple representation of the back-

ground features and used it to neglect the background features in

the target model and in the target candidate region extracted for tar-

get localization phase. Yin and Collins (2006) also proposed a spa-

tial divide-and-conquer method to discriminate the target from the

background. These techniques are effective when the background is

simple but their performance is poor whenever some objects or ele-

ments of the background have similar features to the target.

The second critical factor is the robust estimation of face move-

ments. Even though a technique that discriminates between faces

and nonfaces is applied to the face tracking process, the region clas-

sified as faces may contain some false positives, such as skin-col-

ored moving objects and moving hands. Hence, without accurate

estimation of face movements, the tracker may easily converge to

a local maximum that corresponds a false positive. Most real-

time tracking processes predict face movements by using a

Correspondence to: Jun-Hyeong Do; e-mail: [email protected] sponsor: MOST/KOSEF (SRC/ERC Program); Grant number: R11-1999–008.

The School of Information Technology, KAIST (Brain Korea 21 Project).

' 2008 Wiley Periodicals, Inc.

autoregressive (AR) model that renders an approximate model of

face motion on the basis previous information about face move-

ments, such as the position, velocity, and acceleration (Comaniciu

et al., 2003; Shan et al., 2004; Yang et al., 2005). However, this

type of motion model-based estimation has limitations with regard

to accurate estimations of face motions: that is, the face motion in

natural situations is likely to be nonlinear and involve sudden accel-

erations, reversal of rotations, and changes in orientation. More-

over, the corresponding image plane projection causes the modeling

of face motions to become more complex. Because the target move-

ment makes prediction difficult, such conventional prediction tech-

niques easily lose the target and often converge to a local maximum

in the image where face-like objects have merged with the face.

In considering the two factors for robust face tracking, we pro-

pose a novel framework that effectively integrates multicues. In

Section II, we discuss how to select and obtain optimum cues to

maximize the true positive, while the false positive is minimized

for the face classification. In Section III, we describe in detail the

three-stage model. The experimental results, which are presented in

Section IV, show the superior performance of the proposed model.

Finally, we offer our conclusions in Section V.

II. MULTIPLE CUES FOR FACE TRACKING

A. Selection of Cues. To select the optimum cues for effective

face tracking, we analyzed which types of false negatives, true posi-

tives, and false positives could be obtained when a face classifier

that used each cue was applied to a given image.

For the tracking of live human faces, we assumed that a face

could be well characterized by the three salient features of motion,

skin color, and face-like appearance, which are known to be mutu-

ally independent and complementary (Yang et al., 2002). On the ba-

sis of this assumption, we regarded a given input image as a set of

blobs that consists of faces (Fa), hands (Ha), moving objects (MO),

and background blobs (Bg). Each type of blob is classified into sub-

blobs with more specific properties, as depicted in Table I. Basi-

cally, we considered the hands, moving objects, and background

blobs to have skin color, motion, and motionless properties,

respectively.

Table II shows the classification results of faces when the binary

classifiers that use each cue to characterize a face are applied to a

given image. The face classification in which the classifier uses

each cue yields different false negatives, true positives, and false

positives. It goes without saying that both the largest number of

true positives and the smallest number of false positives provide the

best results, but in practice there is a trade-off between true posi-

tives and false positives, even though multiple cues are integrated.

As a result, the optimum cues for face tracking are difficult to

determine.

To prevent some of the face regions from being filtered, we ini-

tially selected cues that had a maximum number of true positives

even if some false positives are occurred as depicted in Table III.

With these selected cues, there may be false positives, such as such

as Haðx;1Þ; MOðx;1Þ; Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞ.With regard to the false positives, we describe a novel method that

prevents the tracker from converging to these false positive blobs in

Section V.

B. Effective Extraction of Selected Cues. Let the saliency

map of a given cue, Scue(xi, k), be a two-dimensional map that

expresses the saliency of the cue at location xi of the kth frame,

where 0 � Scue(xi, k) � 1. As the basis of the proposed model, the

saliency map of each cue is obtained with respect to robustness

under various environmental conditions.

B.1. Skin Color. The saliency map of the skin color, Ss(xi, k), isobtained by means of a generalized skin color model. For more

accurate modeling of the skin color distribution, histogram models

Table I. Blob constitution in an image.

Image Blobs Sub-Blobs

Iinput5 {Fa(a,m), Ha(a,m), MO(a,c), Bg(a,c)} Faces : Fa(a,m) {Fa(0,0), Fa(0,1), Fa(1,0), Fa(1,1), Fa(2,0), Fa(2,1)}

Hands : Ha(a,m) {Ha(0,0), Ha(0,1), Ha(1,0), Ha(1,1), Ha(2,0), Ha(2,1)}

Moving objects : MO(a,c) {MO(0,0), MO(0,1), MO(1,0), MO(1,1), MO(2,0), MO(2,1)}

Background blobs : Bg(a,c) {Bg(0,0), Bg(0,1), Bg(1,0), Bg(1,1), Bg(2,0), Bg(2,1)}

Property of Blobs

a, Appearance m,Motion c, Color

0 Frontal face-like appearance 0 Motionless 0 Nonskin

1 Profile-like appearance 1 In motion 1 Skin

2 Others x Do not care x Do not care

x Do not care

Table II. Face classification using individual cues.

Cue Used

Classification Result of Faces

Face Nonface

False Negative (FN) True Positive (TP) False Positive (FP)

Appearance (frontal) Fa(1,x), Fa(2,x) Fa(0,x) Ha(0,x), MO(0,x), Bg(0,x)Appearance (frontal and profile) Fa(2,x) Fa(0,x), Fa(1,x) Ha(0,x), MO(0,x), Bg(0,x), Ha(1,x), MO(1,x), Bg(1,x)Skin color 0 Fa(x,x) Ha(x,x), MO(x,1), Bg(x,1)Motion Fa(x,0) Fa(x,1) Ha(x,1), MO(x,x), 0

Combination of cues {Fa(x,x)}–TP TPcue1 \ TPcue2 \. . . FPcue1 \ FPcue2 \. . .

322 Vol. 17, 321–327 (2007)

are used instead of a single Gaussian model or a Gaussian mixture

model, which is known to be ineffective in modeling complex-

shaped skin color distributions (Phung et al., 2005). To segment the

general skin color blobs regardless of an individual’s skin color, we

constructed two-dimensional skin and nonskin color histogram

models from the huge databases of skin and nonskin color images,

respectively, gathered by Jones and Rehg (1998). Each model had a

size of 256 bins per color channel and was constructed in the UV

color space to be robust against illumination changes.

With the two color histograms, we can compute the probability

of a skin color for a given uv value on the basis of the Bayes rule as

follows:

PðsjuvÞ ¼ PðuvjsÞPðsÞPðuvjsÞPðsÞ þ Pðuvj:sÞPð:sÞ ¼

s½uv�s½uv� þ n½uv� ; ð1Þ

where s[uv] and n[uv] are the number of pixels that correspond to

the bin uv of the skin and nonskin color histograms, respectively.

Then, from Eq. (1), we can calculate Ss(xi, k) as follows:

Ssðxi; kÞ ¼1 if Pðsjuvðxi; kÞÞ > a and s½uvðxi; kÞ� > b0 otherwise

�: ð2Þ

Here, a and b are threshold values.

B.2. Motion Blobs. A set of complete motion blobs is obtained

by applying a closing operation to the motion pixels calculated

from the frame difference, which allows the motion blobs to be

acquired without initializing the scene model. The saliency maps of the

motion pixel andmotion blob are sequentially computed as follows:

Sm pðxi; kÞ ¼1 if Iin iðxi; kÞ � Iin iðxi; k � 1Þj j > c0 otherwise

�ð3Þ

Sm bðxi; kÞ ¼ ClosingðSm pðxi; kÞÞ; ð4Þ

where Iin_i(k) is the intensity component of the input image, Iin(k),at the kth frame and c is a threshold value.

B.3. Previous Face Location. The face blob under tracking is

represented by an elliptical region with an aspect ratio of 1.4, where

the minor axis, hx, and the major axis, hy, are equal to 0.86h and

1.2h, respectively, and h 3 h is the area of the detected face.

The saliency map of the previous location of the nth face blob at

the kth frame, Sp_f(n) (xi,k) is defined as

SðnÞp f ðxi; kÞ ¼

1 Ssðxi; kÞ � Kxi�yðnÞf

ðk�1ÞhðnÞðk�1Þ

� �> 0

0 otherwise

8<: ; ð5Þ

where hðnÞ ¼ ½hðnÞx ; hðnÞy �T ; yðnÞf ðkÞ is the center of gravity (COG) of

the nth face at the kth frame; K is the Epanechnikov kernel, n 5

0, . . . , N; and N is the number of faces being tracked.

III. THREE-STAGE MODEL

A. Overall Configuration. According to some studies of cogni-

tive scientists, the mechanism of the focus of attention in biological

vision systems is known to be broadly comprised of preattentive

and postattentive stages (Neisser, 1967; Toyama and Hager, 1999).

The preattentive stage rapidly finds regions of interest (ROIs) and

excludes most of non-ROIs; the attention is then focused on the

ROIs in the postattentive stage, where the attentive region is exam-

ined more closely. Hinted by such a notion of attention, as shown in

Figure 1, we propose a three-stage model that consists of a preatten-

tive stage, an assignment stage, and a postattentive stage.

In the preattentive stage, face candidate blobs that contain both

skin color and motion blobs are extracted, while most of the non-

face regions are excluded with low computational cost. The assign-

ment stage assigns each tracked face the candidate blob with the

largest overlap region on each face. The assignor plays a role to pre-

vent the interference by other candidate blobs that are not related to

the tracking of each face.

The tracking process for each face is then conducted on the

assigned region and on the previous location of the tracked face by

using the information about the face candidate blobs from the cur-

rent and previous frames as well as using the face color model.

B. Preattentive Stage. Among the selected cues for the face

tracking, the face candidate blobs that contain both skin color and

motion blobs are extracted in the preattentive stage. To acquire the

candidate blobs with low computational cost, we considered the pri-

ority of operations and the ROIs where the operations were to be

applied.

By comparing the processing time required to obtain Sm_p, Sm_b,and Ss in the entire region, we found that the computational cost of

obtaining Sm_b is higher than that of Ss or Sm_p due to the morpho-

logical operation conducted to obtain the motion blobs. At first, we

obtained the two saliency maps Ss and Sm_p, and the processing

regions for obtaining the motion blobs were reduced as follows:

Table III. Combination of cues for face tracking at the kth frame.

Cue Used

Classification Result of Faces

FN TP FP

(Skin color and motion)1 (Face location

at k–1th frame and skin color)

0 Fa(x,x) Ha(x,1), MO(x,1), Fa(x,x)(k21)\ ( Ha(x,x) |MO(x,1) | Bg(x,1))

Figure 1. The overall configuration of the proposed structure.

Vol. 17, 321–327 (2007) 323

Figure 2. False positive errors in the face tracking process. [Color figure can be viewed in the online issue, which is available atwww.interscience.wiley.com.]

ROI1 � ROIðSs \ Sm pÞ ð6Þ

ROI2 � [Rr¼1ðROIðSðrÞs Þ within ROI1Þ; ð7Þ

where ROI(image) is the rectangular box region surrounding the

foreground pixels in the given image, and R is the number of blobs

within the ROI1 of Ss.

After applying ROI2 to obtain Sm_b, we then acquired the sali-

ency map of the face candidate blob, Sc, and their ROIs as follows:

Sc ¼ FilteringðSm b \ SsÞ within ROI2 ð8Þ

ROIc ¼ [Mm¼1 ROI SðmÞc

� �� : ð9Þ

where Sm_b 5 Closing(Sm_p within ROI2) and M is the number of

the face candidate blobs. Note that the small blobs are eliminated

by a filtering process.

C. Assignment Stage. Among the candidate blobs extracted in

the preattentive stage, one blob for the tracking of each face is

selected by comparing the overlap ratio between candidate blobs

and each face blob of the previous frame.

The selected candidate blob for the tracking of the nth face at

the kth frame is calculated as follows:

Seðn;kÞ¼

arg maxm

Ov SðmÞc ðkÞ;SðnÞp f ðkÞ

� �� if max

mOv S

ðmÞc ðkÞ;SðnÞp f ðkÞ

� �� 6¼0

0 otherwise

(

ð10Þ

where

Ov SðmÞc ðkÞ; SðnÞp f ðkÞ

� �¼

Area SðmÞc ðkÞ \ S

ðnÞp f ðkÞ

� �Area S

ðnÞp f ðkÞ

� � ð11Þ

and Area(image) is the area of the foreground pixels in the given image.

After the candidate blobs have been assigned for face tracking,

the unselected candidate blobs are then assigned for face detection.

D. Postattentive Stage. According to the face classification re-

sults obtained with the optimum cues, as discussed in Section II.A,

the face tracker can track all types of faces but, as shown in

Figure 2, it may also follow the blobs that correspond to the false

positives, such as Haðx;1Þ;MOðx;1Þ; Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞ during the tracking.

The false positives by Ha(x,1) and MO(x,1) (Figs. 2a and 2b and

2c) occur because of the use of skin color and motion cues, which

are contained in the candidate blobs, Sc, during the preattentive

stage. Meanwhile, the Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞblobs (Fig. 2d) occur because of the use of a combination of the

face location in the previous frame and the skin color. To prevent

the face tracker from converging to these two types of false positive

blobs, some tracking schemes are proposed in the prerecognition

and localization phases, respectively.

D.1. Initialization of Target Model. When the nth face is newly

detected by a detection process, the target model for the nth face

tracking, qðnÞ, is initialized by the color probability distributed func-

tion of the target through the use of the U-bin histograms given by

qðnÞ ¼ qðnÞu

n ou¼1...U

: ð12Þ

As a technique to prevent the target model from containing the

background information, we applied the saliency map of the mthcandidate blob assigned for the nth face detection to pick out only

324 Vol. 17, 321–327 (2007)

the skin color and motion pixels in the detected face region. Let the

function b(xi) be the index of the histogram bin associated with the

pixel value at xi. The probability of the uth bin of the nth target his-

togram model at the kth frame is then computed as

qðnÞu ðkÞ ¼ CðnÞq ðkÞXi

SðmÞc ðxi; kÞ � Kxi � y

ðnÞf ðkÞ

hðnÞðkÞ

!� d½bðxi; kÞ � u�;

ð13Þ

where d is the Kronecker delta function and Cq(n) (k) is the normal-

ization constant.

D.2. Prerecognition of Target Movement. In this phase, the

direction of face movement is estimated to prevent the tracker from

converging to a local maximum that corresponds to a moving hand

(Ha(x,1)) or to a skin-colored moving object (MO(x,1)) prior to the

face localization phase.

The preattentive stage allows us to anticipate the nth face in-

formation of the current frame from the candidate blob assigned

for the nth face tracking before the tracking process is carried out

in the localization phase. Thus, by using the face information of

current frame as well as previous frame with low computational

cost, the tracking process is capable of prerecognizing the nthface state of the current frame and estimating the face movement

more accurately in real-time than AR model-based prediction

methods (Comaniciu et al., 2003; Shan et al., 2004; Yang et al.,

2005).

Note that the false positives contained in Sc(Se(n)), the saliency

map of the Se(n)th candidate blob assigned for the nth face track-

ing, increase the difficulty of using Sc(Se(n)) itself to estimate the face

movement. However, by considering the variation of the COG of

Sc(Se(n)) between the previous frame and the current frame, we also

note that the false positive blob may affect the estimation of the nthface movement in a negative way when it merges with the nth face

blob (kth frame in Figures 2a, 2b, and 2c) or when it demerges from

the nth face (k 1 1th frame in Fig. 2a).

To minimize the negative effect due to variations of the skin-

colored moving objects that are already merged with a face blob, as

shown in Figures 2b and 2c, we initially specified the ROI for the

nth face tracking at the kth frame, ROIt(n) (k), as a rectangular region

that surrounds the nth face region of the k 2 1th frame with a small

margin (ah � hx(n) (k2 1)), as follows:

ROIðnÞt ðkÞ ¼ ROI S

ðnÞe f ðk � 1Þ

� �; ð14Þ

where

SðnÞe f ðxi; kÞ ¼

1 if Kxi�yðnÞf

ðkÞhðnÞðkÞþah �hðnÞx ðkÞ�½1 1�T

� �> 0

0 otherwise

8<: ð15Þ

and 0 < ah < 1.

As for the false positive that occurs when a skin-colored moving

object merges with the face blob at the kth frame, as shown in the k2 1th and kth frames of Figure 2a, we observed that the COG of

Sc(Se(n,k)) has moved toward the merged false positive blob compared

with the COG of Sc(Se(n,k21)) of the k 2 1th frame. This movement

may lead to a tracking failure in the sense that the tracker converges

to the merged false positive blob in the localization phase. When the

skin-colored moving object demerges from the face blob at the k 1

1th frame as shown in the kth and k1 1th frames of Figure 2a, on the

other hand, the COG of Sc(Se(n,k11)) has moved toward the opposite

direction from the moving hand, which makes the tracker converge

to the face in the localization phase. Therefore, the pixels that are

turned into foreground in the candidate region by the merged false

positive blobs at the current frame should be ignored in calculating

the movement of the COG of Sc(Se(n)).

On the basis of the above properties, we can conclude that the

movement direction of the nth face at the kth frame can be prere-

cognized accurately as follows if we ignore the newly appearing

candidate region and consider the disappeared and remained candi-

date regions in Sc(Se(n,k)) (k) within ROIt

(n) (k) compared with

Sc(Se(n,k21)) (k2 1) within ROIt

(n) (k):

Moðn; kÞ ¼

am COG SðSeðn;kÞÞc ðkÞ \ SðSeðn;k�1ÞÞc ðk � 1Þwithin ROIðnÞt ðkÞ

� �� COG SðSeðn;k�1ÞÞc ðk � 1Þ within ROI

ðnÞt ðkÞ

� ��ð16Þ

where am is a positive constant. Furthermore, if Sc(Se(n,k21)) (k 2 1)5

0 or Sc(Se(n,k)) (k)5 0, then Mo(n, k)5 0.

The search window of the face localization phase then begins

searching for the target at the region centered at

yðnÞs ðkÞ yðnÞf ðk � 1Þ þMoðn; kÞ: ð17Þ

D.3. Localization of the Target. In the localization phase, the

target candidate region is compared with the target model by using

a similarity function with the Bhattacharyya coefficient; and the

candidate region that maximizes the similarity value is found by a

mean shift procedure (Comaniciu et al., 2003).

The candidate region, like the target model, is characterized by

the color probability distributed function of a U-bin histogram. The

probability of the uth bin in the color histogram of the candidate

region, centered at y, for the nth face tracking at the kth frame is

calculated as follows:

pðnÞu ðy; kÞ ¼ CðnÞp ðkÞXi

SðnÞt ðxi; kÞ � K

xi � y

ð1þ DÞ � hðnÞðk � 1Þ

!� d½bðxi; kÞ � u�; ð18Þ

where pðyÞ ¼ fpuðyÞgu¼1...U and Cp(n) (k) is the normalization con-

stant. Here, the size of the candidate region is enlarged to (1 1 D) �h(n) (k 2 1) to deduce in which direction the face is enlarged if the

face is enlarged. A typical value for D is 0.1. The saliency map of the

face tracking, St(n), ignores the pixels that are likely to contain back-

ground features or nonskin color. The saliency map also assigns a low

weight to the region where the nth face was located at the k 2 1th

frame to prevent convergence to the false positive errors included in

Sp_f(n), such as shown in Figure 2d. We can express the saliency map as

follows:

SðnÞt ðxi;kÞ¼

SðSeðn;kÞÞc ðxi;kÞþat �SðnÞp f ðxi;kÞ � 1�S

ðSeðn;kÞÞc ðxi;kÞ

� �xi2ROIðnÞt ðkÞ

0 otherwise

(;

ð19Þ

where 0 < at < 1.

Vol. 17, 321–327 (2007) 325

D.4. Update of the Target Model. After finding the target loca-

tion, yf(n) (k), of the nth face at the kth frame, we compared the simi-

larities between the target model and the target candidates centered

at yf(n)(k) with different axis lengths, such as hðnÞðkÞ¼

hðnÞðk�1Þ; hðnÞðkÞ¼ð1þDÞ �hðnÞðk�1Þ, and hðnÞ ðkÞ¼ð1�DÞ �hðnÞðk�1Þ, to adapt the size of the nth face. The nth target model is

then updated solely by extracting the pixels on the foreground

region of St(n), St_b

(n), within the localized target region centered at

yf(n) (k). This extraction can be expressed as

qðnÞu ðkÞ CðnÞq ðkÞXi

SðnÞt b ðxi;kÞ �K

xi�yðnÞf ðkÞ

hðnÞðkÞ

!�d½bðxi;kÞ�u�:

ð20Þ

D.5. Detection of a Tracking Failure. For verification of a

tracking failure, the nth tracker is eliminated when it is motionless

or the skin color ratio within the tracked region is low for several

consecutive frames. In the case of an overlap between two trackers,

the smaller of the two trackers is immediately eliminated.

Figure 3. A comparison of the automatic face detection/tracking results. [Color figure can be viewed in the online issue, which is available at

www.interscience.wiley.com.]

Table IV. Experimental results of the automatic face

detection/tracking process

Method

Detection

Rate (%)

No. of False

Positives

Processing

Time (ms/f)

Kernel-based mean shift without

an update phase [4]

41.0 6,273 17.0

Kernel-based mean shift with an

update phase

43.4 5,413 17.7

Three-stage Model without a

prerecognition phase

(without the prediction phase)

77.6 2,405 17.5

Three-stage Model without a

prerecognition phase

(with the prediction phase

using an AR model [7])

88.6 1,423 17.6

Three-stage model 93.4 913 17.7

326 Vol. 17, 321–327 (2007)

IV. EXPERIMENTAL RESULTS

To evaluate the performance of our proposed model, we applied it

to 100 image sequences with a frame resolution of 3203 240 pixels

(8453 labeled faces and 6208 frames), which contained various sit-

uations, backgrounds, and illumination conditions. The application

was executed on a P-IV 3.2 Ghz PC.

We carried out the following comparisons for the face tracking

process: a kernel-based mean shift tracking method (Comaniciu

et al., 2003), which is a representative deterministic method and

does not update target model; a kernel-based mean shift tracking

method with an update phase that updates the target model at each

frame; the three-stage model without a prerecognition phase that

begins searching for the face at the previous face position in the

localization phase; the three-stage model without a prerecognition

phase that begins searching for the face at a position predicted by

an AR model in the localization phase (Yang et al., 2005); and the

three-stage model. Table IV and Figure 3 show the results of our

comparison.

Even though the kernel-based mean shift methods work well in

the background where the faces are distinctive, they often fail to

track in a skin-colored background, as shown in Figure 3b. Figure 3a

shows a good example that represents the difference between the

two mean shift methods. A mean shift method without an update

phase works well in an environment where the color distribution of

the target does not change, but this method is sensitive to changes

in illumination. On the other hand, the mean shift method with an

update phase of the target model often fails to track due to an inac-

curate target model where the background information accumulates

at each frame. Note also, as shown in Figure 3b, that the three-stage

model without a prerecognition phase cannot follow a face in an

environment where a person wears a skin-colored shirt and is walk-

ing around in a skin-colored background.

V. CONCLUDING REMARKS

After giving consideration to two critical factors, we have proposed

the three-stage model for robust tracking in real-time despite the

interruption of objects with face-like features.

The preattentive and assignment stages protect the target model

and the target candidate region from including image regions that

do not belong to the face region. The prerecognition phase of the

postattentive stage makes the tracker estimate face movements

accurately despite interruption from false positive blobs.

We have demonstrated that the proposed three-stage model pre-

vents convergence to a local maximum, while maintaining a low

computational cost. Moreover, the performance of the model is

superior to that of other well-known methods.

REFERENCES

D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking, IEEE

Trans Pattern Analysis Machine Intelligence 25 (2003), 564–577.

M.D. Cordea, E.M. Petriu, N.D. Georganas, D.C. Petriu, and T.E. Whalen,

Real-time 2(1/2)-D head pose recovery for model-based video-coding, IEEE

Trans Instrum Measurement 50 (2001), 1007–1013.

M. Jones and J. Rehg, Statistical color models with application to skin detec-

tion, Compaq Cambridge Res Lab Tech Rep CRL 98/11, 1998.

U. Neisser, Cognitive psychology, Appleton-Century-Crofts, New York,

1967.

S.L. Phung, A. Bouzerdoum, and D. Chai, Skin segmentation using color

pixel classification: Analysis and comparison, IEEE Trans Pattern Analysis

Machine Intelligence 27 (2005), 148–154.

C. Shan, Y. Wei, T. Tan, F. Ojardias, Real time hand tracking by computing

particle filtering and mean Shift, Proc IEEE Conf Automatic Face and Ges-

ture Recognition, 2004, pp. 669–674.

K. Toyama and G.D. Hager, Incremental focus of attention for robust

vision-based tracking, Int J Comput Vis 35 (1999), 45–63.

J. Triesch and C. Von der Malsburg, Democratic intergration: self-organized

integration of adaptive cues, Neural Comput 13 (2001), 2049–2074.

Y. Wu and T.S. Huang, Robust visual tracking by integrating multiple cues

based on co-inference learning, Int J Comput Vis 58 (2004), 55–71.

C. Yang, R. Duraiswami, and L. Davis, Fast multiple object tracking via

a hierarchical particle filter, Proc IEEE Conf Computer Vision 2005,

pp. 212–219.

M.-H. Yang, D.J. Kriegman, and N. Ahuja, Detecting faces in images: A

survey, IEEE Trans Pattern Analysis Machine Intelligence 24 (2002), 34–58.

Z. Yin and R. Collins, Spatial divide and conquer with motion cues for

tracking through clutter, Proc IEEE Conf Computer Vision Pattern Recogni-

tion, 2006, pp. 570–577.

Vol. 17, 321–327 (2007) 327

three-stage model for robust real-time face tracking

Documents