detecting questionable observers using face track …kwb/barr_bowyer_flynn_wacv_2011.pdffigure 2....

Detecting Questionable Observers Using Face Track Clustering

Jeremiah R. Barr, Kevin W. Bowyer and Patrick J. FlynnDepartment of Computer Science and Engineering

University of Notre DameEmail: {jbarr1,kwb,flynn}@nd.edu

Abstract

We introduce the questionable observer detection prob-lem: Given a collection of videos of crowds, determinewhich individuals appear unusually often across the set ofvideos. The algorithm proposed here detects these individ-uals by clustering sequences of face images. To provide ro-bustness to sensor noise, facial expression and resolutionvariations, blur, and intermittent occlusions, we merge sim-ilar face image sequences from the same video and discardoutlying face patterns prior to clustering. We present ex-periments on a challenging video dataset. The results showthat the proposed method can surpass the performance of aclustering algorithm based on the VeriLook face recognitionsoftware by Neurotechnology both in terms of the detectionrate and the false detection frequency.

1. IntroductionA potentially useful application of automatic face recog-

nition technology is that of analyzing videos of crowds ob-serving the aftermath of criminal activities such as arson.Informants, accomplices or culprits may observe a collec-tion of related crime scenes; this tendency could indicatetheir involvement in a series of offenses. We thus call theseindividuals questionable observers. In contrast, we call in-dividuals that observe relatively few scenes of this naturecasual observers. The distinguishing feature that separatesthe questionable observers from the casual observers is thepercentage of videos in which they appear. We propose analgorithm for the questionable observer detection problem,which is to differentiate questionable observers from casualobservers.

Whereas much of the work on face recognition fromvideo assumes that an identification algorithm has priorknowledge about the persons to recognize and typically fo-cuses on the recognition of one subject at a time, the prob-lem posed here requires the use of unsupervised classifica-tion techniques to aggregate images of the same face ob-served in different crowd videos. Identifying information

Figure 1. A solution to the questionable observer detection prob-lem is a collection of face track clusters, each of which representsa distinct individual and contains face tracks from multiple videos.

is lost when crowd members occlude one another, whichmakes both tracking and classification more difficult. Vari-ations in pose, illumination, and facial expression through-out a single video and between different videos can affectface appearance and, hence, complicate questionable ob-server detection as well. Finally, the video evidence maybe recorded by camera phones or surveillance cameras andso the quality of the face image sequences can be very low.

We propose an unsupervised classification algorithm fordetecting questionable observers that begins by mergingface image sequences, i.e. face tracks, that correspond to thesame individual and come from a particular video. We thenremove outlying face images from the merged face tracksbased on the observations that certain head poses and facialexpressions are more likely than others [2, 3]. This opera-tion reduces the influence of unrepresentative data and in-creases the homogeneity of the sampling encompassed byan individual face track. The detection algorithm subse-quently clusters the face tracks, ideally placing all of theface tracks that represent a particular individual in the same

cluster and creating a distinct cluster for every individual.If a cluster contains face tracks from more than a specifiednumber of videos, then the detection algorithm outputs allof its images for review, as shown in Figure 1.

This paper is organized as follows. In Section 2, we re-view related work on unsupervised face recognition fromvideo and video indexing. Section 3 presents the prob-lem formulation and our proposed detection method. Ex-perimental results obtained on a challenging crowd videodataset are provided in Section 4. Finally, a discussion ofour conclusions is given in Section 5.

2. Related WorkDuring the last decade, an increased amount of attention

has been placed on using unsupervised learning methods toaid in face recognition from video. Raytchev and Murasepropose graph theoretic clustering methods [12] for recog-nizing faces and learning previously unobserved facial ap-pearances simultaneously. A hierarchical clustering algo-rithm is used by Mian [8] to form face clusters from SIFT(Scale Invariant Feature Transform) descriptors that repre-sent local facial features. Hadid and Pietikainen [7] em-ploy K-means clustering to group feature patterns obtainedby locally linear embedding for the purpose of computingface exemplars. Similarly, hierarchical agglomerative clus-tering is applied on the face manifold, as characterized bythe geodesic distances between face patterns, by Fan andYeung [6] to achieve the same end. These exemplar basedrecognition methods provide efficiency gains as a nearestneighbor classifier does not need to perform as many com-parisons after a dataset is reduced to a collection of repre-sentatives from a large set of face images.

These works focus on face recognition from videosrecorded in one or two homogeneous environments with asingle individual in view at a time. In contrast, we focuson the appearance frequency of individuals within a set ofmulti-person videos. We present experimental results onvideos recorded in highly varied environments containingcrowds of people. Moreover, the questionable observer de-tection problem necessitates clusterings that contain as fewclusters for each person as possible. The formation of re-dundant clusters that contain images of the same person canimpact the number of videos that each cluster spans and,hence, result in false positives, that is, false classificationsof questionable observers as casual observers. The unsuper-vised classification techniques described in [6, 7, 8] yieldclusters of face data with related appearances, e.g. similarexpressions and poses, so that multiple clusters correspondto the same individual. The algorithm described here miti-gates this problem by 1) exploiting the multitude of face ob-servations afforded by videos through the use of face tracks,2) merging disjoint face image sequences from the samevideo that contain images of the same person, and 3) per-

forming outlier detection to reduce the impact of intermit-tent effects that alter facial appearance.

The questionable observer detection problem sharesmuch in common with those encountered in video and photoalbum indexing. Some video indexing algorithms rely onthe repeated appearance of an individual to decompose avideo into semantically coherent subsequences. K-meansclustering is employed by Chan et al. [5] to group fea-ture patterns that correspond to an individual who appearsfrequently in a single news video. Similarly, Ozkan andDuygulu [9] describe a greedy algorithm that exploits facialappearance features as well as information contained withina video transcript with the objective of finding a specifiedperson in a news video. A meeting understanding systemthat clusters face sequences and then counts meeting atten-dees is proposed by Vallespi et al. in [14]. This systemis comprised by an adaptive subspace face tracker and atemporal subspace clustering method that extends the nor-malized cuts clustering algorithm. Prince and Elder [10]combine clustering with a Bayesian approach to count thenumber of different people that appear in a collection offace images. The parameters of a generative probabilis-tic model describing the face manifold are learned duringtraining. This model enables the computation of the poste-rior probability over possible clusterings, so that Bayesianmodel selection can be applied to compare partitionings ofvarying sizes.

These methods use appearance frequency in a singlevideo or over a collection of images as a way to understandthe composition of a dataset, while we focus on detectingindividuals that appear in multiple videos with varied back-grounds and crowds of people in every frame. The ques-tionable observer videos do not afford a text transcript com-prised by additional semantic information that can aid clus-tering, but news videos often do. Likewise, in contrast to themeeting videos discussed in [14], the face tracks of a partic-ular subject from our dataset are not guaranteed to containsimilar illumination conditions as we recorded both indoorand outdoor videos. Finally, the database described in [9]notwithstanding, our dataset of nearly 100 individuals con-tains significantly more people than the test sets mentionedin the video and photo album indexing works [5, 10, 14].

3. Proposed MethodGiven a collection ofm videos, {V1, V2, ..., Vm}, each of

which shows a crowd of people observing a distinct event,we intend to identify those individuals that appear in multi-ple sequences. Each Vi has a corresponding set of mi facetracks, Ti = {ti,1, ti,2, . . . , ti,mi}, where every face trackconsists of a unique sequence of face images that representthe same person. The individual elements in a face trackneed not come from a contiguous sequence of video frames,but no two elements can come from the same video frame.

Figure 2. The proposed questionable observer detection schemebegins with face tracking and proceeds through image normaliza-tion, track merging, outlier detection, cross-video face track clus-tering and video counting phases.

Multiple face tracks from a single video can contain imagesof the same individual if she leaves the view of the cameraand then reenters it later.

We frame the questionable observer detection problemas one of unsupervised classification in which we assign aclass label, l(ti,j), to each face track. The set of face tracksof the same person should be assigned the same label. Acollection of face tracks that share a common label form aface track cluster CL:

CL = {ti,j ∈ ∪mk=1Tk : l(ti,j) = L}. (1)

The questionable observer detection problem requires thatwe mark any cluster CL as questionable if the number ofvideos from which its constituent tracks were extracted sur-passes a video count threshold v. That is, an individualwhose face tracks make up the majority of a cluster forwhich

|{i ∈ 1, 2, . . . ,m : ∃j such that ti,j ∈ CL}| > v (2)

is considered to be a questionable observer.

3.1. Algorithm Overview

For each video in a collection, we extract its face tracks,normalize the constituent images, and then merge the face

tracks that correspond to the same individual. An outlierdetection algorithm determines a representative collectionof face images within each face track to provide robust-ness to intermittent occlusions or changes in camera focus,facial expression, head pose, etc. Hierarchical agglomera-tive clustering is subsequently performed on the face tracksfrom all of the videos to group together the data that repre-sents the same individual. We count the number of videosfrom which the face tracks in each cluster originate and out-put the clusters that span a number of videos beyond a spec-ified threshold.

3.2. Face Detection and Tracking

Face tracks are formed by detecting faces in each videoframe and connecting them with the faces from prior videoframes that share a set of common features. We employthe face detector provided with the VeriLook 4.0 StandardSDK from Neurotechnology [1], which detects faces ob-served from near frontal viewpoints. Features are trackedacross frames with the Kanade-Lucas-Tomasi (KLT) opti-cal flow tracker [13]. If no features lie on a detected face, anew set of features is found within the face region and asso-ciated with a new track that only contains this face initially.Conversely, if a detected face contains features that weretracked from the prior video frame, its image is inserted atthe end of the face track with which it shares the greatestnumber of features. A face track is marked complete whenthe optical flow tracker cannot find any of its correspondingfeatures in the current video frame.

We also perform automatic eye localization as the eyepositions are used to align face images to a canonical po-sition after tracking. The VeriLook 4.0 Standard SDK in-cludes an eye detector that is designed to locate eyes in stillimages. We found that the relative locations of the samepair of detected eyes can vary between consecutive videoframes. These variations can lead to poor face image align-ments, so that comparisons between two misaligned facetracks that represent that same person suggest that they rep-resent different people.

We handle this issue by using singular value decompo-sition to form a two dimensional linear subspace that spansthe positions of the features associated with a newly createdface track. We compute the positions of the eye detectionswithin the subspace and normalize their coordinates to at-tain partial invariance to scaling. The eye locations for aface that belongs to a preexisting face track are estimatedwith two operations. First, the basis vectors of the linearsubspace spanning the associated feature points are com-puted. Second, the normalized eye coordinates from thefirst face track image are transformed into video frame co-ordinates. The normalized eye positions for a face trackare recomputed when the KLT tracker cannot locate one ofits corresponding features or the estimated eye locations lie

Figure 3. A collection of images from a single face track split intoinlier and outlier subsets. The inliers tend to capture appearancefeatures that remain stable throughout the face track, whereas theoutliers either have poor alignment, exhibit significant in-plane ro-tation or contain a face making a transient facial expression.

more than a percentage of the face width and height awayfrom the eye position estimates from the VeriLook detector.The second condition is enforced to prevent drift while thefirst maintains the accuracy of the normalized eye positionswhen the set of feature points changes.

3.3. Face Image Normalization

In general, the faces observed by the tracker rotate withinthe image plane and vary in scale, the extracted face imagescontain elements of the background, and the dynamic rangeof intensity values varies between video frames. We use thecsuPreprocessNormalize tool from the Colorado State Uni-versity Face Identification Evaluation System Version 5.0[4] to perform geometric normalization, image masking andhistogram equalization on all face images to abate these re-spective issues. Finally, we convert each image into a onedimensional pattern vector by unraveling the non-maskedimage content.

3.4. Track Merging and Outlier Detection

After extracting face tracks and normalizing face images,we iterate over the set of videos and apply hierarchical ag-glomerative clustering (Sec. 3.5) to merge the face tracksfrom a specific video that contain images of the same per-son. We can make clustering decisions in the context of asingle video with a high degree of confidence as some ofthe factors that affect facial appearance, such as illumina-tion and hair growth, typically do not vary as drasticallywithin a video clip as across different video clips.

The merged face tracks can nevertheless span a greaterrange of appearance variations than the input tracks. Someobservations may capture transient facial features caused bychanges in facial expression, momentary head rotation, im-age noise and other influences, as shown in Figure 3. Suchobservations yield unstable feature vectors and increase thewithin-class appearance variability insofar as they cause aparticular face to appear different when viewed at differenttimes.

We employ an outlier detection algorithm to alleviate theimpact of such problems on later clustering decisions. For aface track t with mt face patterns, the algorithm determines

the average distance from each face pattern in t to its nearestneighbors. If the average distance for a pattern lies beyond aspecified outlier boundary, that pattern is discarded. Specif-ically, outlying face patterns are detected with the followingsteps:

1. For each pattern xi in face track t, compute the averagepattern distance, d(i)a , to its k nearest neighbors withrespect to the L2 norm.

2. Compute the mean, µa, and the standard deviation, σa,of the average pattern distances.

3. Discard any pattern xi for which d(i)a > µa + s ∗ σa,where s is scaling parameter that defines the boundarybetween outliers and inliers.

We chose k = 0.25 ∗mt, but found that the performanceof the outlier detection algorithm is not particularly sensi-tive to the choice of k. The scaling parameter, s, was setto -0.7 based on experimental results. We also constrainedthe third step to always leave at least one pattern in a trackto avoid the degenerate case where a large number of facetracks are left empty.

3.5. Face Track Clustering

We perform hierarchical agglomerative clustering(HAC) to group together similar face tracks over a varietyof scales. The HAC algorithm takes a similarity matrix asinput and outputs a tree, called a dendrogram, in whichevery level contains a cluster formed by merging a pairof clusters from levels located closer to the leaves. Theleaves of the tree represent singleton clusters consisting ofindividual face tracks, whereas the root represents a singlecluster that contains all face tracks. The dendrogram levelscorrespond to similarity values, i.e. the clusters that aremerged together to form clusters near the leaves are moresimilar than the clusters that are merged together to formclusters near the root. A particular set of clusters is selectedby cutting the tree at some level.

The distance d between two face tracks, t′ and t′′, isgiven by the variance of their face patterns. That is, we ag-gregate the l closest face pattern pairs that span both tracksinto a face pattern set X = {x1, x2, . . . , x2l} and take

d(t′, t′′) =1

2l

2l∑i=1

||xi − µX ||2, (3)

where µX is the mean pattern of X . We set l =min(|t′|, |t′′|) to guarantee that we weight the overall con-tributions of the two face tracks equally when computingthe mean. Moreover, we construct X from the nearest pairsof patterns to ensure that appearance variations related toidentity, as opposed to those resulting from differences in

pose or facial expression, are the primary factors in com-puting the variance.

We apply the HAC algorithm to cluster face tracks basedon the pairwise track distances. This clustering method re-flects a bottom-up approach, as shown below:

1. Generate a collection of initial clusters,{C1, C2, . . . , CT }, where T is the number offace tracks and each cluster contains a single track.

2. While we have more than one cluster,

(a) find the most similar pair of clusters, Ci and Cj ,according to a metric S, and

(b) merge Ci and Cj .

The selection of the cluster similarity metric, S, can de-termine the shape of clusters represented by the dendro-gram. We define S in terms of the least similar pair oftracks from two clusters as indicated by the pairwise trackvariance, d.

S = mint′∈Ci,t′′∈Cj

−d(t′, t′′). (4)

The use of the min operator results in compact clusterswith small diameters as opposed to chain-like clusters withlarge diameters, which tend to result from the applicationof the max operator. The variance of two tracks located atthe opposite ends of a chain-like cluster can be significantlyhigher than the variance of a pair of tracks encompassedby a compact cluster. Allowing for such a large varianceincreases the chance that a cluster will contain face tracksthat represent different individuals.

We select the dendrogram level that minimizes the ratioof the total intra-cluster variance to the inter-cluster vari-ance, as each cluster should be homogeneous with respect toidentity and different clusters should represent different in-dividuals. Moreover, if we choose a dendrogram level witha large number of clusters, multiple clusters may containface tracks from the same person. This can decrease ques-tionable observer detection accuracy substantially if noneof these clusters contains face tracks from more than onevideo in which their corresponding individual appears. Wethus take into account the number of groups within a clus-tering when cutting the dendrogram.

Instead of measuring cluster variance in terms of facetracks, we decompose each track into its face patterns andcompute the inter- and intra-cluster variances using thesevectors. For a cluster Ci = {x1, x2, . . . , xci} with ci facepatterns and cluster center µi, we express the intra-clustervariance as

V ar(Ci) =1

ci

ci∑i=1

||xi − µi||2 (5)

Figure 4. An example of a dendrogram constructed from the facetracks of two videos. A particular partitioning of the data is ob-tained by cutting the dendrogram at the level that minimizes thecost function, ε. If the level indicated by the red dashed lineyields the lowest value of ε, we would cut the dendrogram atthat level to attain the following clustering: C1 = {t1,2}, C2 ={t1,1, t2,2}, C3 = {t2,3}, C4 = {t1,3, t2,1}, where Ci denotesa cluster. Furthermore, clusters C2 and C4 would be labelledas questionable observer clusters assuming that the video countthreshold was set to one video.

under the assumption that all face images are equally likelyto be observed. We compute the overall intra-cluster vari-ance, V ari, by taking the sum of all the individual intra-cluster variances. The inter-cluster variance, V are, is ex-pressed in terms of the cluster centers:

V are =1

c

c∑i=1

||µi − µ||2 (6)

where c denotes the number of clusters and µ represents themean of the cluster centers. The clustering level that wechoose is the one that minimizes the following cost func-tion:

ε = α ∗ V ariV are

+ (1− α) ∗ c. (7)

The α weighting parameter determines the tradeoff madebetween clustering accuracy and clustering redundancy. Ahigh value of α will result in an accurate clustering witha large number of redundant clusters, whereas a low valuewill result in a lot of clusters that contain data from multiplepeople but fewer redundant clusters. For the first clusteringby which the face tracks that were extracted from the samevideo and represent a common individual are merged, α isset to 1.0 to avoid propagating errors to the second cluster-ing step. For the second clustering by which the face tracksfrom multiple videos are grouped together, we determinedthat α = 0.97 provides the best results experimentally.

4. Experimental ResultsTo evaluate the ability of the proposed technique to de-

tect questionable observers, we acquired a dataset com-prised by fourteen 25-59 second crowd video clips recorded

Figure 5. Complicating factors in the experimental dataset. Toprow: images of two questionable observers taken under varyingillumination conditions. Middle row: images of another two ques-tionable observers making distinct facial expressions in differentvideos. Bottom row: instances where subjects were occluded byother crowd members or their own body parts.

in different locations around the University of Notre Damecampus over a period of seven months. This collec-tion of videos, called the ND-QO-Flip dataset, contains90 subjects overall, five of whom appear in multiplevideos and 85 of whom appear in one video. The ap-pearances of the questionable observers differ betweenvideos as they wear different clothes and their facial hairlengths change, due to the seven-month time lapse. De-tails about how to obtain this dataset are available athttp://cse.nd.edu/ cvrl/CVRL/Data Sets.html.

In each clip, the camera pans and zooms over a crowdof four to 12 people. Most people are seen from a nearlyfrontal viewpoint because the observers tend to face towardthe camera or focus on an object behind it. The crowdmembers were allowed to exhibit any facial expression theychose. The video set contains 12 outdoor videos, includ-ing six that were aquired under overcast conditions, six thatwere recorded when the sun was visible, three with snowcover and one with falling snow. The videos thus containextensive variations in illumination and facial expressionalong with partial face occlusions caused by the way thecrowds formed.

A Cisco Flip handheld camcorder was used to acquirethe videos. All of the videos were compressed with theH.264 codec, have a 640x480 resolution and a frame rateof 30 frames per second. As shown in Figure 6, the Flipcamcorder performs automatic white balancing and expo-sure control, which can cause the color range of a video tovary between successive frames. Moreover, the zoom fea-ture is implemented in software and so some video framescan become highly blurred. Finally, the videos tend to haveshort subsequences in which the colors of adjacent image

Figure 6. Face images from the Flip videos. Top row: blurry im-ages recorded under full zoom. Bottom row: instances where theautomatic exposure and white balancing adjustments changed fa-cial appearance within the same face track.

regions bleed together.

4.1. Performance Metrics

In the ideal clustering, all face tracks belonging to thesame individual would be assigned to the same cluster andall individuals would have a distinct cluster. The classifi-cation accuracy would be perfect in this case, as would thequestionable observer detection rate. The clusterings pro-duced in practice differ from this ideal because of two typesof errors: 1) some of the face tracks belonging to one sub-ject may be assigned to a cluster that better represents adifferent subject, or 2) multiple clusters may represent thesame individual.

The first type of error impacts clustering accuracy byreducing the homogeneity of clusters with respect to iden-tity. The self-organization rate (SOR), which was used byRaytchev and Murase [11], accounts for this error type. Itis given by

SOR = (1− Σnab + nen

), (8)

where nab denotes the number of patterns representing in-dividual a that were assigned to a cluster dominated by thepatterns of individual b, ne represents the number of pat-terns that are assigned to a cluster in which no single indi-vidual corresponds to more than half of the patterns, and ndenotes the number of patterns in the clustering.

When the images of an individual represent the majorityin numerous clusters, all but one of those clusters are redun-dant. For a particular person a whose data points representthe majority in ca clusters, ca − 1 of those clusters are re-dundant. Hence, the number of redundant clusters is givenby

cr =∑a

ca − 1. (9)

In turn, we define the cluster conciseness, CON , as

CON = (1− crc

), (10)

where c is the number of clusters. CON is inversely pro-portional to the number of redundant clusters so that higherCON values indicate a higher quality clustering, as is thecase for the SOR.

The false positive rate (FPR) and false negative rate(FNR) are used to convey how well the detection algorithmdistinguished questionable observers from casual observers.A false positive occurs when any cluster that represents acasual observer has face tracks from more than v videos. Afalse negative occurs when none of the clusters that corre-spond to a questionable observer contain face tracks frommore than v videos. Both the false positive and false neg-ative rates vary from zero to one. A lower value indicatesbetter detection performance in each case.

4.2. Results

We implemented four additional detection methods forthe purposes of comparison. Three of these provide themeans to determine the impacts of track merging and out-lier detection. The first performs HAC without track merg-ing or outlier detection, the second combines HAC withtrack merging, while the third employs HAC and outlier de-tection. We refer to these techniques as HAC, HAC/trackmerging and HAC/outlier detection, respectively.

The fourth technique, called HAC/VeriLook, incorpo-rates components from the state-of-the-art VeriLook facerecognition software by Neurotechnology [1]. The Ver-iLook 4.0 Standard SDK provides a face template gener-ation algorithm that can create a generalized template char-acterizing a wide range of appearance variations for a par-ticular person. We performed k-means clustering on everyface track after the track merging step to form clusters rep-resenting different appearance modes. We subsequently se-lected the highest quality face image from each cluster, asindicated by the VeriLook face detector, and inserted theminto the generalized template for the associated face track.The generalized templates from all of the videos were com-pared using the VeriLook matcher to create a similarity ma-trix. The hierarchical agglomerative clustering algorithmdiscussed in Section 3.5 was then used to group the facetracks based on the VeriLook similarity scores. The cluster-ing is selected from the dendrogram in terms of the ques-tionable observer detection FPR and FNR, not equation 7,so that the HAC/VeriLook algorithm has the advantage ofusing the best grouping possible with respect to the detec-tion performance metrics.

As the results in Table 1 demonstrate, the track mergingand outlier detection steps improved questionable observerdetection performance, which allowed the proposed algo-rithm to outperform the HAC/VeriLook algorithm across allmetrics. Track merging increased clustering accuracy anddecreased the percentage of redundant tracks, as shown bythe superior performance of the HAC/track merging method

Table 1. Questionable observer detection results for a video countthreshold of one video with the best result for each performancemetric highlighted in green.

Method SOR CON FPR FNRHAC 0.966 0.603 0.018 0.60HAC/track merging 0.972 0.629 0.018 0.60HAC/outlier detection 0.912 0.642 0.064 0.00HAC/VeriLook 0.895 0.317 0.061 0.40Proposed algorithm 0.960 0.664 0.056 0.00

relative to the HAC technique and that of the proposed al-gorithm relative to the HAC/outlier detection method.

Both the proposed algorithm and the HAC/outlier de-tection technique surpass the HAC method in terms of theCON and FNR yet produce less accurate clusterings withmore false detections. One reason for this discrepancy isthat the outlier detection step removes images from all facetracks. The misclassified face tracks that contain a relativelylarge number of images incur a greater penalty after a sig-nificant number of face images are discarded overall. TheHAC/outlier detection method and the proposed algorithmachieved a lower FNR because they yielded fewer redun-dant clusters, but they obtained a higher FPR since theyformed more clusters that span multiple identities. Theseperformance trends exemplify the tradeoff between cluster-ing accuracy and redundancy as well as the tradeoff betweenthe FPR and the FNR.

In the context of the questionable observer detectionproblem, the penalty for a false negative is generally higherthan that for a false positive. The detection algorithm out-puts a relatively small number of questionable observerclusters for review. In the case of a false positive, a user cananalyze such a cluster to verify that it actually contains a setof face tracks that were extracted from a specified number ofvideos and represent the same person. In the case of a falsenegative, all of the clusters that represent a questionable ob-server are assigned to a large collection of casual observerclusters. The time cost associated with reviewing this col-lection is much higher than that associated with validatingthe questionable observer clusters. These relative costs in-dicate that the achievement of a low FNR is of the utmostimportance, followed by the attainments of a low FPR anda high SOR. Hence, the proposed algorithm achieved thebest performance balance for the questionable observer de-tection problem.

5. Conclusions

The face classification problem presented in this paperis difficult for a number of reasons. Unlike the face ex-emplar selection task to which face clustering methods areoften applied [6, 7, 8], the questionable observer detection

problem does not allow for the aggregation of a large num-ber of redundant clusters as this could result in an excessivenumber of false negatives. Here, we focus on analyzingthe appearance frequency of individuals across collectionsof crowd videos acquired under a wide variety of illumina-tion conditions, which, to the best of our knowledge, is anunresearched application of face clustering.

The instance of the questionable observer detectionproblem considered here is challenging due to the largenumber of nuisance variables within the experimentaldataset, including facial expression, illumination, occlu-sion, and video quality, amongst others. We found thata combination of track merging, outlier detection and hi-erarchical agglomerative clustering can successfully detectthe questionable observers while yielding a low frequencyof false detections despite these difficulties. Moreover,these techniques outperformed a system constructed fromthe VeriLook 4.0 Standard SDK [1] in terms of the cluster-ing conciseness as well as the self-organization, false posi-tive and false negative rates.

6. AcknowledgementsThis research is supported by the Central Intelligence

Agency, the Biometrics Task Force and the Technical Sup-port Working Group through US Army contract W91CRB-08-C-0093. The opinions, findings, and conclusions or rec-ommendations expressed in this publication are those of theauthors and do not necessarily reflect the views of our spon-sors.

References[1] VeriLook Standard SDK and Extended SDK.

http://www.neurotechnology.com/vl sdk.html, 2010.[2] O. Arandjelovic and R. Cipolla. An information-theoretic

approach to face recognition from face motion manifolds.Image and Vision Computing, 24(6):639–647, 2006.

[3] J. Bavelas and N. Chovil. Faces in dialogue. The psychologyof facial expression., pages 334–346, 1997.

[4] D. S. Bolme, J. R. Beveridge, M. Teixeira, and B. A. Draper.The CSU face identification evaluation system: Its purpose,features and structure. Proc. International Conference onVision Systems, pages 304–311, 2003.

[5] Y. Chan, S. Lin, Y. Tan, and S. Kung. Video shot classifi-cation using human faces. Proc. IEEE International Confer-ence on Image Processing, 3:843–6, 1996.

[6] W. Fan and D. Yeung. Face recognition with image sets usinghierarchically extracted exemplars from appearance mani-folds. Proc. 7th International Conference on Automatic Faceand Gesture Recognition, pages 177–82, 2006.

[7] A. Hadid and M. Pietikainen. Selecting models from videosfor appearance-based face recognition. Proc. 17th Interna-tional Conference on Pattern Recognition, 1:304–8, 2004.

[8] A. Mian. Unsupervised learning from local features forvideo-based face recognition. Proc. 8th IEEE International

Conference on Automatic Face and Gesture Recognition,2008.

[9] D. Ozkan and P. Duygulu. Finding people frequently appear-ing in news. Proc. 5th International Conference on Imageand Video Retrieval, pages 173–82, 2006.

[10] S. Prince and J. Elder. Bayesian identity clustering. Proc.2010 Canadian Conference on Computer and Robot Vision,pages 32 –39, 2010.

[11] B. Raytchev and H. Murase. Unsupervised face recognitionfrom image sequences based on clustering with attractionand repulsion. Proc. 2001 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition, 2:II–25–30, 2001.

[12] B. Raytchev and H. Murase. Unsupervised face recognitionby associative chaining. Pattern Recognition, 36:245–57,2003.

[13] J. Shi and C. Tomasi. Good features to track. Proc. IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition, pages 593–600, 1994.

[14] C. Vallespi, F. De la Torre, M. Veloso, and T. Kanade. Au-tomatic clustering of faces in meetings. Proc. IEEE Interna-tional Conference on Image Processing, pages 1841–1844,Oct. 2006.

detecting questionable observers using face track …kwb/barr_bowyer_flynn_wacv_2011.pdffigure 2....

Documents