three-stage model for robust real-time face tracking
TRANSCRIPT
Three-Stage Model for Robust Real-Time Face Tracking
Jun-Hyeong Do, Zeungnam Bien
Division of Electrical Engineering, School of Electrical Engineering and Computer Science, KAIST,373-1 Guseong-Dong, Yuseong-Gu, Daejeon, Korea
Received 19 May 2007; accepted 16 January 2008
ABSTRACT: We propose a novel face tracking framework, the three-
stage model, for robust face tracking against interruptions from face-like blobs. For robust face tracking in real-time, we considered two
critical factors in the construction of the proposed model. One factor
is the exclusion of background information in the initialization of thetarget model, the extraction of the target candidate region, and the
updating of the target model. The other factor is the robust estimation
of face movement under various environmental conditions. The pro-
posed three-stage model consists of a preattentive stage, an assign-ment stage, and a postattentive stage with a prerecognition phase.
The model is constructed by means of effective integration of opti-
mum cues that are selected in consideration of the trade-off between
true positives and false positives of face classification based on acontext-dependant type of categorization. The experimental results
demonstrate that the proposed tracking method improves the per-
formance of the real-time face tracking process in terms of successrates and with robustness against interruptions from face-like
blobs. VVC 2008 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 17, 321–
327, 2007; Published online in Wiley InterScience (www.interscience.wiley.
com). DOI 10.1002/ima.20126
Key words: face tracking; multi-stage model; multiple cues;
prerecognition phase; candidate blob
I. INTRODUCTION
In the process of human face tracking for a given video sequence, a
face tracker often fails to track targets under various environmental
conditions such as varying illumination and size of the tracked face,
skin-colored background, abrupt motion of the face, and occlusion
by other objects. The type of failure in which a tracker tracks face-
like objects instead of a face is induced by the fact that the tracker
lacks the capability of accurately distinguishing between faces and
face-like objects.
A typical real-time face tracking process contains three phases:
an initialization phase, in which a target model is initialized when a
face is detected; a localization phase, where the tracker finds the
most similar target candidate region in comparison with the target
model; and the update phase, in which the target model is updated
in response to varying environmental conditions. To implement a
robust face tracker in various environmental conditions and real-
time, we considered two critical factors of the tracking process.
The first critical factor is that a technique is required for exclud-
ing as much background information as possible from the target
model and target candidate region, even if face-like objects are in
the background. The target model and the target candidate region
that contain the background information recursively produce inac-
curate target localization and inaccurate updates of the target
model, thereby causing tracking failure. In particular, when the dis-
tributions of face features and background features are similar, the
tracker may easily converge to a face-like object in the background.
For example, color information, which is easy to implement and
insensitive to the variation of pose, expression, scale, and rotation,
is used in most face tracking processes and yields a high recognition
rate; however, if the tracker relies solely a color cue, the perform-
ance can be affected by other skin-tone objects. To complement
this defect, several researchers have used multiple cues. Triesch and
Malsburg (2001) integrated five different visual cues to detect and
track a single face by calculating the weighted sum of each cue’s
observational probability. On the other hand, Cordea et al. (2001)
and Wu and Huang (2004) combined the color cue and the face
contour on the basis of edge detection to increase the robustness
and generality, though this method is time-consuming and often
fails to track the face in a skin-colored complex background due to
the detection error of the face contour. In another approach, Coma-
niciu et al. (2003) derived a simple representation of the back-
ground features and used it to neglect the background features in
the target model and in the target candidate region extracted for tar-
get localization phase. Yin and Collins (2006) also proposed a spa-
tial divide-and-conquer method to discriminate the target from the
background. These techniques are effective when the background is
simple but their performance is poor whenever some objects or ele-
ments of the background have similar features to the target.
The second critical factor is the robust estimation of face move-
ments. Even though a technique that discriminates between faces
and nonfaces is applied to the face tracking process, the region clas-
sified as faces may contain some false positives, such as skin-col-
ored moving objects and moving hands. Hence, without accurate
estimation of face movements, the tracker may easily converge to
a local maximum that corresponds a false positive. Most real-
time tracking processes predict face movements by using a
Correspondence to: Jun-Hyeong Do; e-mail: [email protected] sponsor: MOST/KOSEF (SRC/ERC Program); Grant number: R11-1999–008.
The School of Information Technology, KAIST (Brain Korea 21 Project).
' 2008 Wiley Periodicals, Inc.
autoregressive (AR) model that renders an approximate model of
face motion on the basis previous information about face move-
ments, such as the position, velocity, and acceleration (Comaniciu
et al., 2003; Shan et al., 2004; Yang et al., 2005). However, this
type of motion model-based estimation has limitations with regard
to accurate estimations of face motions: that is, the face motion in
natural situations is likely to be nonlinear and involve sudden accel-
erations, reversal of rotations, and changes in orientation. More-
over, the corresponding image plane projection causes the modeling
of face motions to become more complex. Because the target move-
ment makes prediction difficult, such conventional prediction tech-
niques easily lose the target and often converge to a local maximum
in the image where face-like objects have merged with the face.
In considering the two factors for robust face tracking, we pro-
pose a novel framework that effectively integrates multicues. In
Section II, we discuss how to select and obtain optimum cues to
maximize the true positive, while the false positive is minimized
for the face classification. In Section III, we describe in detail the
three-stage model. The experimental results, which are presented in
Section IV, show the superior performance of the proposed model.
Finally, we offer our conclusions in Section V.
II. MULTIPLE CUES FOR FACE TRACKING
A. Selection of Cues. To select the optimum cues for effective
face tracking, we analyzed which types of false negatives, true posi-
tives, and false positives could be obtained when a face classifier
that used each cue was applied to a given image.
For the tracking of live human faces, we assumed that a face
could be well characterized by the three salient features of motion,
skin color, and face-like appearance, which are known to be mutu-
ally independent and complementary (Yang et al., 2002). On the ba-
sis of this assumption, we regarded a given input image as a set of
blobs that consists of faces (Fa), hands (Ha), moving objects (MO),
and background blobs (Bg). Each type of blob is classified into sub-
blobs with more specific properties, as depicted in Table I. Basi-
cally, we considered the hands, moving objects, and background
blobs to have skin color, motion, and motionless properties,
respectively.
Table II shows the classification results of faces when the binary
classifiers that use each cue to characterize a face are applied to a
given image. The face classification in which the classifier uses
each cue yields different false negatives, true positives, and false
positives. It goes without saying that both the largest number of
true positives and the smallest number of false positives provide the
best results, but in practice there is a trade-off between true posi-
tives and false positives, even though multiple cues are integrated.
As a result, the optimum cues for face tracking are difficult to
determine.
To prevent some of the face regions from being filtered, we ini-
tially selected cues that had a maximum number of true positives
even if some false positives are occurred as depicted in Table III.
With these selected cues, there may be false positives, such as such
as Haðx;1Þ; MOðx;1Þ; Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞ.With regard to the false positives, we describe a novel method that
prevents the tracker from converging to these false positive blobs in
Section V.
B. Effective Extraction of Selected Cues. Let the saliency
map of a given cue, Scue(xi, k), be a two-dimensional map that
expresses the saliency of the cue at location xi of the kth frame,
where 0 � Scue(xi, k) � 1. As the basis of the proposed model, the
saliency map of each cue is obtained with respect to robustness
under various environmental conditions.
B.1. Skin Color. The saliency map of the skin color, Ss(xi, k), isobtained by means of a generalized skin color model. For more
accurate modeling of the skin color distribution, histogram models
Table I. Blob constitution in an image.
Image Blobs Sub-Blobs
Iinput5 {Fa(a,m), Ha(a,m), MO(a,c), Bg(a,c)} Faces : Fa(a,m) {Fa(0,0), Fa(0,1), Fa(1,0), Fa(1,1), Fa(2,0), Fa(2,1)}
Hands : Ha(a,m) {Ha(0,0), Ha(0,1), Ha(1,0), Ha(1,1), Ha(2,0), Ha(2,1)}
Moving objects : MO(a,c) {MO(0,0), MO(0,1), MO(1,0), MO(1,1), MO(2,0), MO(2,1)}
Background blobs : Bg(a,c) {Bg(0,0), Bg(0,1), Bg(1,0), Bg(1,1), Bg(2,0), Bg(2,1)}
Property of Blobs
a, Appearance m,Motion c, Color
0 Frontal face-like appearance 0 Motionless 0 Nonskin
1 Profile-like appearance 1 In motion 1 Skin
2 Others x Do not care x Do not care
x Do not care
Table II. Face classification using individual cues.
Cue Used
Classification Result of Faces
Face Nonface
False Negative (FN) True Positive (TP) False Positive (FP)
Appearance (frontal) Fa(1,x), Fa(2,x) Fa(0,x) Ha(0,x), MO(0,x), Bg(0,x)Appearance (frontal and profile) Fa(2,x) Fa(0,x), Fa(1,x) Ha(0,x), MO(0,x), Bg(0,x), Ha(1,x), MO(1,x), Bg(1,x)Skin color 0 Fa(x,x) Ha(x,x), MO(x,1), Bg(x,1)Motion Fa(x,0) Fa(x,1) Ha(x,1), MO(x,x), 0
Combination of cues {Fa(x,x)}–TP TPcue1 \ TPcue2 \. . . FPcue1 \ FPcue2 \. . .
322 Vol. 17, 321–327 (2007)
are used instead of a single Gaussian model or a Gaussian mixture
model, which is known to be ineffective in modeling complex-
shaped skin color distributions (Phung et al., 2005). To segment the
general skin color blobs regardless of an individual’s skin color, we
constructed two-dimensional skin and nonskin color histogram
models from the huge databases of skin and nonskin color images,
respectively, gathered by Jones and Rehg (1998). Each model had a
size of 256 bins per color channel and was constructed in the UV
color space to be robust against illumination changes.
With the two color histograms, we can compute the probability
of a skin color for a given uv value on the basis of the Bayes rule as
follows:
PðsjuvÞ ¼ PðuvjsÞPðsÞPðuvjsÞPðsÞ þ Pðuvj:sÞPð:sÞ ¼
s½uv�s½uv� þ n½uv� ; ð1Þ
where s[uv] and n[uv] are the number of pixels that correspond to
the bin uv of the skin and nonskin color histograms, respectively.
Then, from Eq. (1), we can calculate Ss(xi, k) as follows:
Ssðxi; kÞ ¼1 if Pðsjuvðxi; kÞÞ > a and s½uvðxi; kÞ� > b0 otherwise
�: ð2Þ
Here, a and b are threshold values.
B.2. Motion Blobs. A set of complete motion blobs is obtained
by applying a closing operation to the motion pixels calculated
from the frame difference, which allows the motion blobs to be
acquired without initializing the scene model. The saliency maps of the
motion pixel andmotion blob are sequentially computed as follows:
Sm pðxi; kÞ ¼1 if Iin iðxi; kÞ � Iin iðxi; k � 1Þj j > c0 otherwise
�ð3Þ
Sm bðxi; kÞ ¼ ClosingðSm pðxi; kÞÞ; ð4Þ
where Iin_i(k) is the intensity component of the input image, Iin(k),at the kth frame and c is a threshold value.
B.3. Previous Face Location. The face blob under tracking is
represented by an elliptical region with an aspect ratio of 1.4, where
the minor axis, hx, and the major axis, hy, are equal to 0.86h and
1.2h, respectively, and h 3 h is the area of the detected face.
The saliency map of the previous location of the nth face blob at
the kth frame, Sp_f(n) (xi,k) is defined as
SðnÞp f ðxi; kÞ ¼
1 Ssðxi; kÞ � Kxi�yðnÞf
ðk�1ÞhðnÞðk�1Þ
� �> 0
0 otherwise
8<: ; ð5Þ
where hðnÞ ¼ ½hðnÞx ; hðnÞy �T ; yðnÞf ðkÞ is the center of gravity (COG) of
the nth face at the kth frame; K is the Epanechnikov kernel, n 5
0, . . . , N; and N is the number of faces being tracked.
III. THREE-STAGE MODEL
A. Overall Configuration. According to some studies of cogni-
tive scientists, the mechanism of the focus of attention in biological
vision systems is known to be broadly comprised of preattentive
and postattentive stages (Neisser, 1967; Toyama and Hager, 1999).
The preattentive stage rapidly finds regions of interest (ROIs) and
excludes most of non-ROIs; the attention is then focused on the
ROIs in the postattentive stage, where the attentive region is exam-
ined more closely. Hinted by such a notion of attention, as shown in
Figure 1, we propose a three-stage model that consists of a preatten-
tive stage, an assignment stage, and a postattentive stage.
In the preattentive stage, face candidate blobs that contain both
skin color and motion blobs are extracted, while most of the non-
face regions are excluded with low computational cost. The assign-
ment stage assigns each tracked face the candidate blob with the
largest overlap region on each face. The assignor plays a role to pre-
vent the interference by other candidate blobs that are not related to
the tracking of each face.
The tracking process for each face is then conducted on the
assigned region and on the previous location of the tracked face by
using the information about the face candidate blobs from the cur-
rent and previous frames as well as using the face color model.
B. Preattentive Stage. Among the selected cues for the face
tracking, the face candidate blobs that contain both skin color and
motion blobs are extracted in the preattentive stage. To acquire the
candidate blobs with low computational cost, we considered the pri-
ority of operations and the ROIs where the operations were to be
applied.
By comparing the processing time required to obtain Sm_p, Sm_b,and Ss in the entire region, we found that the computational cost of
obtaining Sm_b is higher than that of Ss or Sm_p due to the morpho-
logical operation conducted to obtain the motion blobs. At first, we
obtained the two saliency maps Ss and Sm_p, and the processing
regions for obtaining the motion blobs were reduced as follows:
Table III. Combination of cues for face tracking at the kth frame.
Cue Used
Classification Result of Faces
FN TP FP
(Skin color and motion)1 (Face location
at k–1th frame and skin color)
0 Fa(x,x) Ha(x,1), MO(x,1), Fa(x,x)(k21)\ ( Ha(x,x) |MO(x,1) | Bg(x,1))
Figure 1. The overall configuration of the proposed structure.
Vol. 17, 321–327 (2007) 323
Figure 2. False positive errors in the face tracking process. [Color figure can be viewed in the online issue, which is available atwww.interscience.wiley.com.]
ROI1 � ROIðSs \ Sm pÞ ð6Þ
ROI2 � [Rr¼1ðROIðSðrÞs Þ within ROI1Þ; ð7Þ
where ROI(image) is the rectangular box region surrounding the
foreground pixels in the given image, and R is the number of blobs
within the ROI1 of Ss.
After applying ROI2 to obtain Sm_b, we then acquired the sali-
ency map of the face candidate blob, Sc, and their ROIs as follows:
Sc ¼ FilteringðSm b \ SsÞ within ROI2 ð8Þ
ROIc ¼ [Mm¼1 ROI SðmÞc
� �� �: ð9Þ
where Sm_b 5 Closing(Sm_p within ROI2) and M is the number of
the face candidate blobs. Note that the small blobs are eliminated
by a filtering process.
C. Assignment Stage. Among the candidate blobs extracted in
the preattentive stage, one blob for the tracking of each face is
selected by comparing the overlap ratio between candidate blobs
and each face blob of the previous frame.
The selected candidate blob for the tracking of the nth face at
the kth frame is calculated as follows:
Seðn;kÞ¼
arg maxm
Ov SðmÞc ðkÞ;SðnÞp f ðkÞ
� �� �if max
mOv S
ðmÞc ðkÞ;SðnÞp f ðkÞ
� �� �6¼0
0 otherwise
(
ð10Þ
where
Ov SðmÞc ðkÞ; SðnÞp f ðkÞ
� �¼
Area SðmÞc ðkÞ \ S
ðnÞp f ðkÞ
� �Area S
ðnÞp f ðkÞ
� � ð11Þ
and Area(image) is the area of the foreground pixels in the given image.
After the candidate blobs have been assigned for face tracking,
the unselected candidate blobs are then assigned for face detection.
D. Postattentive Stage. According to the face classification re-
sults obtained with the optimum cues, as discussed in Section II.A,
the face tracker can track all types of faces but, as shown in
Figure 2, it may also follow the blobs that correspond to the false
positives, such as Haðx;1Þ;MOðx;1Þ; Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞ during the tracking.
The false positives by Ha(x,1) and MO(x,1) (Figs. 2a and 2b and
2c) occur because of the use of skin color and motion cues, which
are contained in the candidate blobs, Sc, during the preattentive
stage. Meanwhile, the Faðx;xÞðk � 1Þ \ ðHaðx;xÞ [MOðx;1Þ [ Bgðx;1ÞÞblobs (Fig. 2d) occur because of the use of a combination of the
face location in the previous frame and the skin color. To prevent
the face tracker from converging to these two types of false positive
blobs, some tracking schemes are proposed in the prerecognition
and localization phases, respectively.
D.1. Initialization of Target Model. When the nth face is newly
detected by a detection process, the target model for the nth face
tracking, qðnÞ, is initialized by the color probability distributed func-
tion of the target through the use of the U-bin histograms given by
qðnÞ ¼ qðnÞu
n ou¼1...U
: ð12Þ
As a technique to prevent the target model from containing the
background information, we applied the saliency map of the mthcandidate blob assigned for the nth face detection to pick out only
324 Vol. 17, 321–327 (2007)
the skin color and motion pixels in the detected face region. Let the
function b(xi) be the index of the histogram bin associated with the
pixel value at xi. The probability of the uth bin of the nth target his-
togram model at the kth frame is then computed as
qðnÞu ðkÞ ¼ CðnÞq ðkÞXi
SðmÞc ðxi; kÞ � Kxi � y
ðnÞf ðkÞ
hðnÞðkÞ
!� d½bðxi; kÞ � u�;
ð13Þ
where d is the Kronecker delta function and Cq(n) (k) is the normal-
ization constant.
D.2. Prerecognition of Target Movement. In this phase, the
direction of face movement is estimated to prevent the tracker from
converging to a local maximum that corresponds to a moving hand
(Ha(x,1)) or to a skin-colored moving object (MO(x,1)) prior to the
face localization phase.
The preattentive stage allows us to anticipate the nth face in-
formation of the current frame from the candidate blob assigned
for the nth face tracking before the tracking process is carried out
in the localization phase. Thus, by using the face information of
current frame as well as previous frame with low computational
cost, the tracking process is capable of prerecognizing the nthface state of the current frame and estimating the face movement
more accurately in real-time than AR model-based prediction
methods (Comaniciu et al., 2003; Shan et al., 2004; Yang et al.,
2005).
Note that the false positives contained in Sc(Se(n)), the saliency
map of the Se(n)th candidate blob assigned for the nth face track-
ing, increase the difficulty of using Sc(Se(n)) itself to estimate the face
movement. However, by considering the variation of the COG of
Sc(Se(n)) between the previous frame and the current frame, we also
note that the false positive blob may affect the estimation of the nthface movement in a negative way when it merges with the nth face
blob (kth frame in Figures 2a, 2b, and 2c) or when it demerges from
the nth face (k 1 1th frame in Fig. 2a).
To minimize the negative effect due to variations of the skin-
colored moving objects that are already merged with a face blob, as
shown in Figures 2b and 2c, we initially specified the ROI for the
nth face tracking at the kth frame, ROIt(n) (k), as a rectangular region
that surrounds the nth face region of the k 2 1th frame with a small
margin (ah � hx(n) (k2 1)), as follows:
ROIðnÞt ðkÞ ¼ ROI S
ðnÞe f ðk � 1Þ
� �; ð14Þ
where
SðnÞe f ðxi; kÞ ¼
1 if Kxi�yðnÞf
ðkÞhðnÞðkÞþah �hðnÞx ðkÞ�½1 1�T
� �> 0
0 otherwise
8<: ð15Þ
and 0 < ah < 1.
As for the false positive that occurs when a skin-colored moving
object merges with the face blob at the kth frame, as shown in the k2 1th and kth frames of Figure 2a, we observed that the COG of
Sc(Se(n,k)) has moved toward the merged false positive blob compared
with the COG of Sc(Se(n,k21)) of the k 2 1th frame. This movement
may lead to a tracking failure in the sense that the tracker converges
to the merged false positive blob in the localization phase. When the
skin-colored moving object demerges from the face blob at the k 1
1th frame as shown in the kth and k1 1th frames of Figure 2a, on the
other hand, the COG of Sc(Se(n,k11)) has moved toward the opposite
direction from the moving hand, which makes the tracker converge
to the face in the localization phase. Therefore, the pixels that are
turned into foreground in the candidate region by the merged false
positive blobs at the current frame should be ignored in calculating
the movement of the COG of Sc(Se(n)).
On the basis of the above properties, we can conclude that the
movement direction of the nth face at the kth frame can be prere-
cognized accurately as follows if we ignore the newly appearing
candidate region and consider the disappeared and remained candi-
date regions in Sc(Se(n,k)) (k) within ROIt
(n) (k) compared with
Sc(Se(n,k21)) (k2 1) within ROIt
(n) (k):
Moðn; kÞ ¼
am COG SðSeðn;kÞÞc ðkÞ \ SðSeðn;k�1ÞÞc ðk � 1Þwithin ROIðnÞt ðkÞ
� ��� COG SðSeðn;k�1ÞÞc ðk � 1Þ within ROI
ðnÞt ðkÞ
� ��ð16Þ
where am is a positive constant. Furthermore, if Sc(Se(n,k21)) (k 2 1)5
0 or Sc(Se(n,k)) (k)5 0, then Mo(n, k)5 0.
The search window of the face localization phase then begins
searching for the target at the region centered at
yðnÞs ðkÞ yðnÞf ðk � 1Þ þMoðn; kÞ: ð17Þ
D.3. Localization of the Target. In the localization phase, the
target candidate region is compared with the target model by using
a similarity function with the Bhattacharyya coefficient; and the
candidate region that maximizes the similarity value is found by a
mean shift procedure (Comaniciu et al., 2003).
The candidate region, like the target model, is characterized by
the color probability distributed function of a U-bin histogram. The
probability of the uth bin in the color histogram of the candidate
region, centered at y, for the nth face tracking at the kth frame is
calculated as follows:
pðnÞu ðy; kÞ ¼ CðnÞp ðkÞXi
SðnÞt ðxi; kÞ � K
xi � y
ð1þ DÞ � hðnÞðk � 1Þ
!� d½bðxi; kÞ � u�; ð18Þ
where pðyÞ ¼ fpuðyÞgu¼1...U and Cp(n) (k) is the normalization con-
stant. Here, the size of the candidate region is enlarged to (1 1 D) �h(n) (k 2 1) to deduce in which direction the face is enlarged if the
face is enlarged. A typical value for D is 0.1. The saliency map of the
face tracking, St(n), ignores the pixels that are likely to contain back-
ground features or nonskin color. The saliency map also assigns a low
weight to the region where the nth face was located at the k 2 1th
frame to prevent convergence to the false positive errors included in
Sp_f(n), such as shown in Figure 2d. We can express the saliency map as
follows:
SðnÞt ðxi;kÞ¼
SðSeðn;kÞÞc ðxi;kÞþat �SðnÞp f ðxi;kÞ � 1�S
ðSeðn;kÞÞc ðxi;kÞ
� �xi2ROIðnÞt ðkÞ
0 otherwise
(;
ð19Þ
where 0 < at < 1.
Vol. 17, 321–327 (2007) 325
D.4. Update of the Target Model. After finding the target loca-
tion, yf(n) (k), of the nth face at the kth frame, we compared the simi-
larities between the target model and the target candidates centered
at yf(n)(k) with different axis lengths, such as hðnÞðkÞ¼
hðnÞðk�1Þ; hðnÞðkÞ¼ð1þDÞ �hðnÞðk�1Þ, and hðnÞ ðkÞ¼ð1�DÞ �hðnÞðk�1Þ, to adapt the size of the nth face. The nth target model is
then updated solely by extracting the pixels on the foreground
region of St(n), St_b
(n), within the localized target region centered at
yf(n) (k). This extraction can be expressed as
qðnÞu ðkÞ CðnÞq ðkÞXi
SðnÞt b ðxi;kÞ �K
xi�yðnÞf ðkÞ
hðnÞðkÞ
!�d½bðxi;kÞ�u�:
ð20Þ
D.5. Detection of a Tracking Failure. For verification of a
tracking failure, the nth tracker is eliminated when it is motionless
or the skin color ratio within the tracked region is low for several
consecutive frames. In the case of an overlap between two trackers,
the smaller of the two trackers is immediately eliminated.
Figure 3. A comparison of the automatic face detection/tracking results. [Color figure can be viewed in the online issue, which is available at
www.interscience.wiley.com.]
Table IV. Experimental results of the automatic face
detection/tracking process
Method
Detection
Rate (%)
No. of False
Positives
Processing
Time (ms/f)
Kernel-based mean shift without
an update phase [4]
41.0 6,273 17.0
Kernel-based mean shift with an
update phase
43.4 5,413 17.7
Three-stage Model without a
prerecognition phase
(without the prediction phase)
77.6 2,405 17.5
Three-stage Model without a
prerecognition phase
(with the prediction phase
using an AR model [7])
88.6 1,423 17.6
Three-stage model 93.4 913 17.7
326 Vol. 17, 321–327 (2007)
IV. EXPERIMENTAL RESULTS
To evaluate the performance of our proposed model, we applied it
to 100 image sequences with a frame resolution of 3203 240 pixels
(8453 labeled faces and 6208 frames), which contained various sit-
uations, backgrounds, and illumination conditions. The application
was executed on a P-IV 3.2 Ghz PC.
We carried out the following comparisons for the face tracking
process: a kernel-based mean shift tracking method (Comaniciu
et al., 2003), which is a representative deterministic method and
does not update target model; a kernel-based mean shift tracking
method with an update phase that updates the target model at each
frame; the three-stage model without a prerecognition phase that
begins searching for the face at the previous face position in the
localization phase; the three-stage model without a prerecognition
phase that begins searching for the face at a position predicted by
an AR model in the localization phase (Yang et al., 2005); and the
three-stage model. Table IV and Figure 3 show the results of our
comparison.
Even though the kernel-based mean shift methods work well in
the background where the faces are distinctive, they often fail to
track in a skin-colored background, as shown in Figure 3b. Figure 3a
shows a good example that represents the difference between the
two mean shift methods. A mean shift method without an update
phase works well in an environment where the color distribution of
the target does not change, but this method is sensitive to changes
in illumination. On the other hand, the mean shift method with an
update phase of the target model often fails to track due to an inac-
curate target model where the background information accumulates
at each frame. Note also, as shown in Figure 3b, that the three-stage
model without a prerecognition phase cannot follow a face in an
environment where a person wears a skin-colored shirt and is walk-
ing around in a skin-colored background.
V. CONCLUDING REMARKS
After giving consideration to two critical factors, we have proposed
the three-stage model for robust tracking in real-time despite the
interruption of objects with face-like features.
The preattentive and assignment stages protect the target model
and the target candidate region from including image regions that
do not belong to the face region. The prerecognition phase of the
postattentive stage makes the tracker estimate face movements
accurately despite interruption from false positive blobs.
We have demonstrated that the proposed three-stage model pre-
vents convergence to a local maximum, while maintaining a low
computational cost. Moreover, the performance of the model is
superior to that of other well-known methods.
REFERENCES
D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking, IEEE
Trans Pattern Analysis Machine Intelligence 25 (2003), 564–577.
M.D. Cordea, E.M. Petriu, N.D. Georganas, D.C. Petriu, and T.E. Whalen,
Real-time 2(1/2)-D head pose recovery for model-based video-coding, IEEE
Trans Instrum Measurement 50 (2001), 1007–1013.
M. Jones and J. Rehg, Statistical color models with application to skin detec-
tion, Compaq Cambridge Res Lab Tech Rep CRL 98/11, 1998.
U. Neisser, Cognitive psychology, Appleton-Century-Crofts, New York,
1967.
S.L. Phung, A. Bouzerdoum, and D. Chai, Skin segmentation using color
pixel classification: Analysis and comparison, IEEE Trans Pattern Analysis
Machine Intelligence 27 (2005), 148–154.
C. Shan, Y. Wei, T. Tan, F. Ojardias, Real time hand tracking by computing
particle filtering and mean Shift, Proc IEEE Conf Automatic Face and Ges-
ture Recognition, 2004, pp. 669–674.
K. Toyama and G.D. Hager, Incremental focus of attention for robust
vision-based tracking, Int J Comput Vis 35 (1999), 45–63.
J. Triesch and C. Von der Malsburg, Democratic intergration: self-organized
integration of adaptive cues, Neural Comput 13 (2001), 2049–2074.
Y. Wu and T.S. Huang, Robust visual tracking by integrating multiple cues
based on co-inference learning, Int J Comput Vis 58 (2004), 55–71.
C. Yang, R. Duraiswami, and L. Davis, Fast multiple object tracking via
a hierarchical particle filter, Proc IEEE Conf Computer Vision 2005,
pp. 212–219.
M.-H. Yang, D.J. Kriegman, and N. Ahuja, Detecting faces in images: A
survey, IEEE Trans Pattern Analysis Machine Intelligence 24 (2002), 34–58.
Z. Yin and R. Collins, Spatial divide and conquer with motion cues for
tracking through clutter, Proc IEEE Conf Computer Vision Pattern Recogni-
tion, 2006, pp. 570–577.
Vol. 17, 321–327 (2007) 327