automatic expression spotting in videos

Image and Vision Computing 32 (2014) 476–486

Contents lists available at ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r .com/ locate / imav is

Automatic expression spotting in videos☆

Matthew Shreve ⁎, Jesse Brizzi, Sergiy Fefilatyev, Timur Luguev, Dmitry Goldgof, Sudeep SarkarDepartment of Computer Science and Engineering, University of South Florida, Tampa, FL, USA

☆ This paper has been recommended for acceptance by⁎ Corresponding author at: 4202 E. Fowler Avenue, Un

118, Tampa, FL 33620-5399, USA. Tel.: +1 813 974 3652;E-mail address: [email protected] (M. Shreve).

http://dx.doi.org/10.1016/j.imavis.2014.04.0100262-8856/© 2014 Published by Elsevier B.V.

a b s t r a c t
a r t i c l e i n f o
Article history:Received 12 February 2013Received in revised form 18 February 2014Accepted 30 April 2014Available online 9 May 2014

Keywords:Expression spottingMacro-expressionsMicro-expressions

In this paper, we propose a novel solution for the problem of segmenting macro- and micro-expression frames(or retrieving the expression intervals) in video sequences, which is a prior step formany expression recognitionalgorithms. The proposed method exploits the non-rigid facial motion that occurs during facial expressions bycapturing the optical strain corresponding to the elastic deformation of facial skin tissue. The method is capableof spotting bothmacro-expressionswhich are typically associatedwith expressed emotions and rapidmicro- ex-pressionswhich are typically associatedwith semi-suppressedmacro-expressions.We test our algorithmon sev-eral datasets, including a newly released hour-long video with two subjects recorded in a natural setting thatincludes spontaneous facial expressions. We also report results on a dataset that contains 75 feigned macro-expressions and 37 feigned micro-expressions. We achieve over a 75% true positive rate with a 1% false positiverate for macro-expressions, and a nearly 80% true positive rate for spotting micro-expressions with a .3% falsepositive rate.

© 2014 Published by Elsevier B.V.

1. Introduction

Accurately and automatically spotting frames containing facialexpressions in videos is often a prior step required for high-level facialanalysis, including identifying emotional response, gestures, and humanidentification. In many papers, this is a manual pre-processing step, or itis assumed that the data consists of a single facial expression sequence.We address this problem using an optical strain based method that iscapable of automatically spotting expressions in video sequences.

In this paper we do not address the problem of identifying expres-sions, but only solve the expression segmentation problem (seeFig. 1). Since our method is based on the non-rigid motion of the face,and not on pre-defined expression models, we are able to capture alarge variety of facial motion corresponding to facial expressions. Inother words, while some traditional techniques are capable of recogniz-ing pre-defined expressions (such as recognizing when a person smilesor shows surprise in a video sequence), it is not possible for these typesof methods to recognize expressions for which the algorithm has notbeen trained. We propose a novel expression spotting method thatcan be used to locate and distinguish between two broad classes of ex-pressions. First are macro-expressions, which are generally character-ized as occurring several seconds over several regions of the face. Thesecond class of expressions typically occurs rapidly and in a single re-gion of the face, or micro-expressions.

Georgios Tzimiropoulos.iversity of South Florida, ENB-fax: +1 813 974 5456.

The method presented in this paper represents our complete workon expression spotting. Earlier ideas related to this method were docu-mented in [1], with some further results reported in [2]. Some of ourearly micro-expression work was reported in [3] and also in a medicalapplication [4]. We have developed a completely new algorithm com-pared with our prior work. Specifically, we have included more robustface tracking, a newmasking technique, and a new peak detection algo-rithm.We show the performance of the algorithm at several scaled res-olutions.We give results onmore challenging datasets, including longervideos that contain a mixture of both macro- and micro-expressionsduring a single sequence, as well as spontaneous (genuine) facialexpressions.

Themethod consists of the following steps: (i) the face and eyes aredetected in all frames of the video sequence. These coordinates are thenused to segment the face in to several regions; (ii) the non-rigidmotionis estimated using an optical flowbasedmethod over several frames, foreach region; (iii) a masking technique is used that removes erroneousflow estimation caused by blinking the eyes or the opening and closingof the mouth; (iv) optical strain maps are calculated over each pair offrames to generate strain maps, which are then summed to generatestrain magnitude for four different regions on the face; (v) lastly, apeak detection method is used to locate intervals that correspondwith macro-expressions [5], and the remaining intervals are then ana-lyzed for micro-expressions.

It is important to note that while many papers address the problemof expression spotting, the definition of spotting is not always consis-tent. It may also be referred to as expression detection, or in the caseof genuine expressions, spontaneous [6] or authentic expression analy-sis [7]. In some papers, this refers to determining if a pre-segmented

http://crossmark.crossref.org/dialog/?doi=10.1016/j.imavis.2014.04.010&domain=pdf

http://dx.doi.org/10.1016/j.imavis.2014.04.010

mailto:[email protected]

http://dx.doi.org/10.1016/j.imavis.2014.04.010

http://www.sciencedirect.com/science/journal/02628856

Fig. 1. The scope of this paper is seen in red (dashed circle). (For interpretation of the refer-ences to color in this figure legend, the reader is referred to the web version of this article.)

477M. Shreve et al. / Image and Vision Computing 32 (2014) 476–486

group of frames does or does not contain an expression. For exampleZeng et al. [6] propose a single classificationmethod for detecting spon-taneous expressions (emotion or non-emotion) by training a SupportVector Data Description (SVDD) on several examples of emotion data.Then, test segments containing roughly an equal number of frames areclassified with a single binary decision (hence chance is 50%). Thesame type of experimental setup can be found in [8] for macro-expressions, and in Pfister et al. [9] and Polikovsky et al. [10] formicro-expressions. In [9], local spatio-temporal features are extractedfrom a high-speed video sequence (100 frame/s) and then performanceismeasured using several classifiers. Similarly in [10], a high speed cam-era (300 frames/s) is used to capture rapid micro-expressions. In theirwork, they use 3-D gradient histograms and the Facial Action CodingSystem (FACS) to spot the 4 states of micro-expressions: onset, apex,apex offset, and apex neutral.

A single frame approach using Gabor filters and GentleSVM is used byWu et al. [11] Their work is based on the assumption that the appearanceof micro-expressions completely resemble macro-expressions, and thusthey reduce the entire micro-expression problem to the temporal dura-tion of macro-expressions. While this definition may fit a subset ofmicro-expressions, we instead use the general categorization found in[12] that defines them as a suppression of macro-expressions. Hence,they are often distorted or only fractional representations of macro-expressions [13].

Some papers that address the problem of automatic expression anal-ysis do so only for pre-trained examples, or in other words, they are ca-pable of recognizing a subset of expressions in uncut videos. For instance,a dynamic approach can be found in the work by Sung et al. [14], the au-thors use generalized discriminant analysis for recognizing a subject ran-domly expressing four different facial expressions (neutral, happy,surprised, angry) in roughly 30 second videos. Another example can befound in [15] where seven expressions (neutral, sadness, anger, joy,fear, disgust, surprise) are automatically recognized in videos. A methodthat can detect several more affective states can be found in [16].

Static (single frame) approaches that model a subset of macro-expressions are also found extensively in the literature, with perhapslocal binary patterns performing among the best [17]. In the work byRuiz-Hernandez and Pietikainen [18], a novel LBP encoding techniqueis proposed that uses a re-parametrization of the second order Gaussianjet. Similar to [9] they do not perform spotting, but test on sequencesthat each contains a single expression.

A popular method for describing several types of expressions on theface is the Facial Action Coding Systems (FACSs). In this system, an ac-tion unit label is given to different types of facialmotion, and the activa-tion of one or more action units can be used to describe a facialexpression. There are several works in the literature that automaticallydetect action units, as well as provide a measure of intensity [19,20].While there is a similarity between detecting the activation of an actionunit and detecting expressions, the two are not equivalent. For instance,in the DISFA dataset [21], every action unit is a measure of several partsof expressions, but not fully reduced; in otherwords, there are still typesof expressions, especially micro-expressions, that are not representedby the labeled action units. Of course, it is possible to find an actionunit to describe every type of motion on the face, but then trainingwould be required on each of them. Hence, the main contribution ofour work is that individual training of for each type of motion, or actionunit, is not needed. The goal of our approach is to successfully detect anytype of motion that causes strain, or deformation on the face.

In the work by Zhou et al. [22], the authors propose to use AlignedCluster Analysis on points obtained using FACS. In their work, they areable to identify differing types of spontaneous facial expressions, al-though it is dependent on a manually defined number of expressionclusters, and performance on micro-expression detection is not given.Another unsupervised approach is given by Liwicki et al. [23]. In this ap-proach, an online temporal video segmentation technique is describedthat uses a subspace learning method. While it has been used to suc-cessfully segment several macro-expressions from a video sequence, itassumes that changes in a scene (or signal) occur slowly, thus it doesnot appear to be suited for rapidmicro-expressions which can occur ina little as 2–3 frames.

Similarly to the unsupervisedmethods, we do not rely on previouslytrained models of expressions. However, we want to emphasize a fewkey highlights of our method: (i) we rely on the fundamental dynamicsof facial expressions, in that they cause the non-rigid deformation of thefacial skin tissue. Hence, our method is naturally suited to spot all ex-pressions that cause facial skin deformation; (ii)we detect facial expres-sion over the entire video sequence, without any manual temporal pre-segmentation (however we do provide results for an experiment that isformatted similarly to [9] and [18] who both test on the SMIC corpus, soperformance can be compared); and (iii) we are not aware of any othermethod that has yet been proposed that detects bothmacro- andmicro-expressions. Finally, for clarity, we propose that expression spottingrefer to the temporal segmentation of an entire video into segmentsthat only contain the frames of each expression.

To further place our method into perspective, we find the categoriza-tion of motion-based methods useful for expression spotting. In general,they have been organized [24] into three types, namely: point-model, ho-listic, or somemixture of these two. Point-model approaches track severalkey points on the face over time. The interplay of these points can then beused to recognize the expression [25]. However, while these algorithmsmay be sufficient for large macro-expressions, the average 2–3 pixel“jittering” in frame to frame tracking often suppresses the nearly equallysmall movement observed in micro-expressions. Alternatively, holisticapproaches track all points on the face [26], and hence becomemore suit-able for detecting a larger variety of possible facial expressions, includingsmall motion.

Our method fits into the last category, i.e., it uses both a point-modeland holistic approach. Our approach segments the face into several re-gions based on several detected landmarks. Then, within these segment-ed regions we use a holistic optical flowmethod that densely tracks eachpoint on the face. Hence, we hope to haveminimized the drawback of ap-proaches in the first category, while taking advantage of the potential ro-bustness associated with methods in the second category.

2. Background

Expressions are generally believed to be the physiological responseto an internal emotional state. While there does appear to be a univer-sality for some expressions (such as happiness, sadness, surprise,disgust, and anger) there are a much larger number of possible expres-sions possible, as well as large inter- and intra-variability between sub-jects for the same expression. For instance in Fig. 2 there someexpressions we may immediately recognize (such as anger in columnd), however some other expressions may be harder to recognize, or infact may not be recognized without context. Hence, the goal is ofthis paper is not to present a method which only spots pre-definedexpressions, but to spot segments containing any possible type offacial expression that involves the strain (or deformation) of facialskin tissue.

2.1. Macro-expressions

Macro-expressions typically last 3/4th of a second to 2 s (roughly24–60 frames) [13]. There are 6 universal expressions: happiness,

Fig. 2. Some expressions can be easily identified,while others can be difficultwithout context. Both subjectswere asked to perform randomexpressions. Eyesmasked for privacy concerns.

478 M. Shreve et al. / Image and Vision Computing 32 (2014) 476–486

sadness, fear, surprise, anger, and disgust. Spatially, macro-expressionscan occur over multiple or single regions of the face, depending on theexpressions. For instance, the surprise expressions generally cause mo-tion around the eyes, forehead, cheeks, andmouth, whereas the expres-sion for fear typically generates motion only near the eyes.

2.2. Micro-expressions

In general, amicro-expression is described as an involuntary patternof the human body that is significant enough to be observable, but maynot fully convey the triggering emotion [13,3].Micro-expressions occur-ring on the face are rapid and are often missed during casual observa-tion. Lasting from 1/25th to 1/3rd of a second (roughly 2–10 frames)[12], micro-expressions can be classified, based on how an expressionis modified, into three types [13]:

• Type 1. Simulated Expressions: When a micro-expressions is not ac-companied by a genuine expression.

• Type 2. Neutralized expressions: When a genuine expression is sup-pressed and the face remains neutral.

• Type 3.Masked Expressions:When a genuine expression is complete-ly masked by a falsified expression.

Type 2 micro-expressions are not observable and type 3 micro-expressions may be completely eclipsed by a falsified expression. In thispaper, we focus on type 1 micro-expressions, i.e., micro-expressionsthat correspond to rapid, but observable and non-suppressed motion onthe face.

3. Optical strain calculation

There are two main approaches for calculating optical strain [2]: (i)integrate the strain definition into the optical flow equations, or (ii) de-rive strain directly from theflowvectors. Thefirst approach requires thecalculation of high order derivatives, hence is sensitive to image noise.The second approach allows us to post-process the flow vectors beforecalculating strain, possibly reducing the effects any errors incurred dur-ing the optical flow estimation.

3.1. Optical flow

Optical flow is a well-known motion estimation technique that isbased on the brightness conservation principle [27]. In general, it as-sumes (i) constant intensity at each point over a pair (sequence) of

frames, and (ii) smooth pixel displacementwithin a small image region.It is typically represented by the following equation:

∇Ið ÞTpþ It ¼ 0 ð1Þ

where I(x,y,t) represents the temporal image intensity function atpoint x and y at time t, and ∇I represents the spatial and temporalgradient. The horizontal and vertical motion vectors are representedby p = [p = dx/dt,q = dy/dt]T.

Since large intervals over a single expression can often cause failure intracking (due to the smoothness constraint), we implemented a vectorlinking (or stitching) process that combines small, local pairs of small in-tervals (1–3 frames) into larger pairs to expand over the entire sequenceof frames. This works by matching the optical flow from one pair offrames p(Fn,Fn + 1) to a consecutive pair of frames p(Fn + 1,Fn + 2) bysumming the (p,q) components from each to generate the larger dis-placement p(Fn,Fn + 2) (see Fig. 3).

3.2. Optical strain

The projected 2-D displacement of any deformable object can beexpressed by a vector u = [u,v]T. If the motion is small enough, thenthe corresponding finite strain tensor is defined as [2,4]:

ε ¼ 12

∇uþ ∇uð ÞTh i

; ð2Þ

which can be expanded to the form:

ε ¼εxx ¼

∂u∂x εxy ¼

12

∂u∂y þ ∂v

∂x

� �

εyx ¼12

∂v∂x þ

∂u∂y

� �εyy ¼

∂v∂y

2664

3775 ð3Þ

where (εxx,εyy) are normal strain components and (εxy,εyx) are shearstrain components.

Since each of these strain components are a function of displacementvectors (u,v) over a continuous space, each strain component is approx-imated using the discrete optical flow data (p,q):

p ¼ δxδt

¼ ΔxΔt

¼ uΔt

;u ¼ pΔt; ð4Þ

Fig. 3. Vector linking (stitching).


q ¼ δyδt

¼ ΔyΔt

¼ vΔt

; v ¼ qΔt ð5Þ

whereΔt is the change in time between two image frames. SettingΔt toa fixed interval length (Δt=1 in our experiments in order to maximizesensitivity to micro-expressions which can last as little as 2–3 frames),we can estimate the partial derivatives of Eqs. (4) and (5):

∂u∂x ¼ ∂p

∂xΔt;∂u∂y ¼ ∂p

∂yΔt; ð6Þ

Fig. 4. (a) Three frames from a smile expression. Eyes are masked for privacy concerns.Column (b) contains the corresponding strain maps.

∂v∂x ¼ ∂q

∂xΔt;∂v∂y ¼ ∂q

∂yΔt: ð7Þ

The second order derivatives are calculated using the central differ-ence method. Hence,

∂u∂x ¼ u xþ Δxð Þ−u x−Δxð Þ

2Δx¼ p xþ Δxð Þ−p x−Δxð Þ

2Δxð8Þ

∂v∂y ¼ v yþ Δyð Þ−v y−Δyð Þ

2Δy¼ q yþ Δyð Þ−q y−Δyð Þ

2Δyð9Þ

where (Δx,Δy) is 1 pixel.Finally, each of these values corresponding to low and large elastic

moduli are summed to generate the strain magnitude. Each value canalso be normalized to 0–255 for a visual representation (strain map).Fig. 4 shows the visualization of the strain values (strain pattern) ob-tained during both a macro- and micro-expression.

Fig. 5. Flow diagram of the automatic facial expression spotting algorithm.

Fig. 6. Facial alignment and segmentation. In (a) the sixty-six points automatically foundby [28] are used to segment out the boundary of the face, eyes, and mouth shown in(b). The face is then segmented into the four regions shown in (c).


4. Algorithm

The algorithm for spotting both macro- and micro-expressions canbe seen in Fig. 5. It consists of several steps, each of which will now bedescribed.

4.1. Face tracking and registration

Facial trackingwas performed using the subspace constrainedmeanshift algorithm [28], which tracks several points on the face over a videosequence. These points are then used for two purposes: i) we are able toalign faces in the video sequence by transforming each image to matchthe original location of the face, hence reducing the amount of motionbetween consecutive frames (such as a person rigidly moving his/herhead back and forth); ii) we can then use these points for segmentingandmasking the face (in themasking step). See Fig. 6(a) for an exampleof a face with the tracked points.

4.2. Crop Image, Optical Flow and Strain

Optical flow calculation can be computationally expensive, so it isbeneficial to reduce the size image before estimating motion. To dothis, we crop the face from the image using the 66 points tracked onthe face. Since all future images are aligned to this coordinate system,all future face images will have the same dimensions. Optical flow is

Fig. 7. Example strain signal. In (a) the values for the summed strainmagnitude are given of rounitude after subtracting the blue line (second degree least squares fit) from (a).

estimated using the MATLAB implementation of the Horn–Schunckmethod. Next, optical strain is then estimated using the central differ-ence method by convolving a 23 × 3 Sobel kernels (Sx,Sy) over each ofthe (p,q) flow fields to generate each of the four strain components of

the strain tensor ε ¼ 12 ∇pþ ∇pð ÞTh i

(see Section 3.2).

4.3. Masking

By localizing several features of the face in the tracking stage, we areable to segment the reliable strain values on the face from those that arenoise. Noisy values are due to inaccurate flow computations caused bythe violation of the smoothness constraints and self-occlusions. Occlu-sions can be found at the boundary of the face, where the backgroundis lost due to the rigid head motion. The mouth is problematic becauseopening/closing themouth are rigidmotions that contain self occlusionsthus causing tracking failures. The eyes are alsomasked because they donot contain non-rigid motion and blinking can cause noisy motionestimations.

4.4. Strain magnitude and pre-processing

The strainmagnitude SR is calculated by summing all values generat-ed separately for each region of the face. Each region then generates asequence corresponding to the amount of strain observed over time.These values are processed using two passes, the first of which spotsmacro-expressions and the second micro-expressions.

Before a peak detector is used to find the points of maximum strain,the values are first pre-processed using the following steps:

• Fit a 2nd degree polynomial to the sequence of total strain magnitudeusing the least squares method.

• At each point, subtract this curve from the sequence.• Perform Gaussian smoothing on the entire sequence.• Normalize sequence between [0,1] using min–max normalization.

The first step minimizes error in optical flow that is accumulatedwhen stitching the flowvalues over the entire video sequence. By fittinga polynomial curve to the sequence and then subtracted from it, we ef-fectively get the approximate mean values at each frame over time (seeblue line in Fig. 7). This step could also use an adaptive mean filter ateach frame if the user wanted to run this on live video streams with

ghly 180 frames in an example video sequence. Part (b) shows the normalized strainmag-

Fig. 8. Some example spotted expressions are shown in (a)–(e) from the strain signal given in (f) and (g). Spottedmacro-expression are denoted by an ‘o’, and spottedmicro-expressionsare denoted by an ‘x’. The number given above each peak in (f) is the number of frames in the expression, also denoted by each solid line.

Fig. 9. Example spotted expression using peak detection and boundary detection.


an unknown duration. Next, to remove false positives due to noisyspikes in the strain calculation, all values are smoothed using a Gaussianfilter. Lastly, min–max normalization allows us to define the searchspace for the parameters neededwhendetecting thepeakswhich corre-spond to macro-expressions. Part b in Fig. 7 shows the final normalizedstrain signal.

4.5. Spot macro-expressions

Since macro-expressions can occur over multiple regions of the facesimultaneously, all strain values contained in all regions are summed togenerate an overall strain magnitude (i.e., themask in Fig. 6(b) is used).

Next, peaks corresponding to macro-expressions are found. Refer toFig. 9 for an illustration. The peak detection [5] uses a parameter αwhich is a threshold on the minimal strain magnitude allowed to be apeak, and β which determines the amount that the peak must beabove the surrounding areas. First, for each value Si ∈ S (where S is

Fig. 10. ROC of macro-expression spotting on USF-combination dataset as β is varied be-tween (.01,.9). AUC = .93.

Table 1EXP-Spontaneous dataset statistics.

Subject 1 Subject 2 Total

# of frames. 116,622 118,954 120,165% of frames 97 98# of exp. 68 104 172% of exp. 39 60ave. length of exp. (frames) 127 112 118ave. length between exp. 1653 1057 1355


the set of all strain magnitudes for all frames), its forward derivative istaken, or

S0 ¼ Si−Siþ1� �

; i ¼ 1…��Sj−1 ð10Þ

Next, all extrema (potential peaks and valleys) are indexed in E wheresign changes occur, or

k∈ E if S0k � S0kþ1b0;∀S0k ∈ S0: ð11Þ

Then, ∀j ∈ E, if

Sj−Sj−1Nβ and Sj N Sjþ1 ð12Þ

or

Sj N Sj−1 andSj−Sjþ1 N β ð13Þ

and

Sj N α ð14Þ

then Sj is a peak corresponding to a macro-expression.Lastly, Sj + 1marks the potential end of the expression while the po-

tential beginning of the expression is given by Sj − 1. The final boundary

Fig. 11. ROC of micro-expression spotting on USF-combination dataset as β is varied be-tween (.01,.9). AUC = .79.

is determined when the strain magnitude matches at both the begin-ning and end of the expressions, for the larger of the two strain magni-tudes. Hence, given Emax = max(Sj − 1,Sj + 1), then the boundary isdetermined at the points on each side of the peak that are approximate-ly equal:

Sl ≈ Sr ≈ T � Sj−max Sj−1; Sjþ1

� �� ð15Þ

where T = .1 (i.e., at 10% of the peak height) and (Sl,Sr) are the nearestpoints found in S, with the corresponding frame numbers (l,r) used todenote the expression beginning and end. Note that if the peak is locat-ed near the boundary, this value will be the boundary itself if no inter-section is found. The last check is to ensure the expression is longenough to be a macro-expression, i.e., the expression boundary mustbe greater than 1/3rd of a second, or roughly 10 frames in duration.See Fig. 8 for spotted macro-expressions on an example sequence.

4.6. Spot micro-expression

After the intervals for macro-expressions are found, the subintervalsare then searched for any micro-expressions. The thresholding tech-nique is nearly identical to that of macro-expressions. First, there is aconstraint on locality, hence micro-expression are restricted to occur-ring in at most two bordering regions (see Fig. 6) of the face (I–II, II–III, III–IV, I–IV). Second, micro-expressions are very rapid (lasting asfew as 3 frames) and often occur in just one region of the face. Hence,to ensure that the width of the peak is consistent (and not noisy spikein flow error), we use a threshold value of T = .5 (Eq. (14)), or halfthe peak height. See Fig. 8 for spottedmicro-expressions on an examplesequence.

4.7. Restarting

For segmenting very long videos, it is necessary to restart the opticalflow calculation. This is an important step primarily because error in theflow linking and estimation can accumulate overmany frames. Based onvisual inspection of the optical flow fields in our experiments, we foundthat after 30 s of video the flow error is too noisy for accurate strain es-timation. Because the optical flow needs to be calculated from the be-ginning of an expression, we restart the algorithm at the end of thelast found expression. If there have been no expressions found in thelast N frames, we restart the tracking at the current frame.

5. Experiment

Experiments were performed on several datasets. The ground-truthlabeling were hand-marked by an analyst that specified the beginningand ending frame of each expression. Before ground-truthing, the

Table 2Average error for find the length of spotted macro- and micro-expression on USF-combination dataset.

Type Mean error Std. deviation

USF-combination macro 10 frames ±10 framesUSF-combination micro 4 frames ±2 frames

Table 3Average error for find the start frame for macro- and micro-expression spotting on USF-Combination dataset.

Type Mean error Std. deviation

USF-combination macro 5 frames ±8 framesUSF-combination micro 4 frames ±2 frames


analyst was first shown several examples of macro- and micro-expressions. We now discuss each dataset, and then report the resultsof macro- and micro-expression spotting.

Fig. 13. Results for experiment using the same dataset as [9]. The experiment consists of154 example sequences, half of which contain micro-expressions. AUC = 0.7849.

5.1. Datasets

5.1.1. USF-combinationThis is a collection of 10 videos that contain 75 feigned macro-ex-

pressions and 37 feigned micro-expressions interspersed within eachvideo sequence. Each video was recorded using a Panasonic AG-HMC40 camcorder at a resolution of 720 × 1280, with an averagevideo length of 1 min.

5.1.2. SMIC microThis dataset [9] is publically available and includes 77 sequences of

micro-expressions, with one micro-expression per sequence. The facesare cropped in these images at an approximate resolution of 140 × 175.

5.1.3. USF-SpontaneousThe EXP-spontaneous dataset was collected from a 66 minute HD

(1920 × 1080 pixel) video clip from a Panasonic HD camera. Thevideo contains two subjects playing a video game on a large monitor.Both subjects were present and frontward facing in the video for 97%of the recording time. The camera was positioned between the subjectsand the television, aimed and centered between the subjects. Thedataset contains 68 expressions by Subject 1 and 104 expressions bySubject 2. The average expression duration reported by Subject 1 is127 frames which is approx. 4.2 s. The average expression duration re-ported by Subject 2 is 112 frames, which is approx. 3.7 s. The averagetime between expressions reported by Subject 1 is 1653 frames whichis approx. 55 s. The average time between expressions reported by Sub-ject 2 is 1057 frames, or approx. 35 s. Talking and head turned awayfrom camera movements (don't care regions) account for .9% of thevideo were labeled in the ground-truth and were not considered as anexpression.

Fig. 12. ROC of micro-expression spotting on SMIC dataset as β is varied between (.01,.9).AUC = 0.83.

Fig. 14. ROC of macro- and micro-expression spotting in videos at different scales (100%,50%, and 25%) as β is varied between (.01,.9). The resolutions given are the approximateface size of the subject at these scales. The AUC for each scale, respectively, is .93, .95,and .93 for macro-expression spotting and .79, .79, and .74 for micro-expression spotting.

Fig. 15.ROC for spotting on EXP-Spontaneous expression dataset as β varies from [.01,.25].For Subject 1 AUC = .87, and for Subject 2 AUC = .75.

Table 5Average error for detected length of expression for USF-Spontaneous.

Subject Mean error Std. deviation

Subject 1 42 frames ±20 framesSubject 2 46 frames ±31 frames


5.1.4. DISFA spontaneous facial action databaseThis is a publicly available dataset that consists of 27 adult subjects

(12 female and 15male). The subjects were recorded with a stereo cam-era (the left camera is used in our experiments) while watching a fourminute video that invoked expressions. Per-frame annotation for 14 ac-tion units are provided, each with an intensity measure (0–5). Each ac-tion unit corresponds to a type of motion on the face, for example,“inner brow raiser”, “outer brow raiser”, “lip corner puller”, etc. Sincewe do not detect distinct types of expressions, we used these annotationsto generate a generalized set of expressions. A macro-expression is de-fined by the activation of one or more action units for more than10 frames. It is worth noting that in the annotation, a single action unitmay be activated for hundreds, and sometimes thousands of frames(for instance, if the person is concentrating on the video, theymay slight-ly furrow their eyebrows for a long period of time). Often, within the in-tervals of these long expressions, there are several sub-expressions thathave higher intensities than the larger expression (this common occur-rence is also reported in [12]). Therefore, in order to properly representthese sub-expressions in the ground-truth, we generate and report re-sults on several expression ground-truth labels for each sequence basedon thresholding the intensity measures at two points (T = 1, T = 3),i.e., in order to be considered an expression, the action unit must havean intensity value greater than or equal to T. Overall, when thresholdingat these two values, we obtain two sets of ground-truth that contain 409and 259 expressions, respectively. It is worth noting that generating a setof micro-expression is not as clear, and using a strict definition such as“exactly one action unit active for less than 10 frames” only generates afew examples in the dataset.

5.2. Results

Results will now be given for each dataset. The performance isshown using an ROC comparing the true positive rate (TPR) and thefalse positive rate, by varying the peak magnitude threshold β. Wealso report the area under the curve (AUC) for each ROC plot.

Table 4Average error for spotting start of expression for USF-Spontaneous.


Subject 1 15 frames ±11 framesSubject 2 31 frames ±7 frames

The true positive rate was calculated by the number of successfullydetected expressions out of the number of total expressions, or

TPR ¼ Ndetected−Exp

Ntotal−Expð16Þ

while the false positive rate,

FPR ¼ NdetectedExpFrames

NFrames−NtotalExpFramesð17Þ

where for each subject, NdetectedExp and NdetectedExpFrames are the numberof correctly detected expressions and the summed total number offrames they contain and similarly, NtotalExp and NtotalExpFrames are thenumber ground-truth expressions and the frames they contain. Lastly,NFrames is the total number of frames in the dataset. For estimating theF1 score in Section 6, we define a false negative as

FN ¼ NtotalExp−NdetectedExp: ð18Þ

We now report the results of finding macro-expressions in themixed dataset USF-combination in Fig. 10. The ROC generated by vary-ing β is promising, achieving 81% TPR with less than a .1% false positiverate, and at its peak successfully spots 94% of the macro-expressionswith less than a .4% FPR. Alternatively, at these two datapoints, thistranslates 58 macro-expressions being spotted successfully at the ex-pense of 4 false positives, and 68 macro-expressions being successfullyspotted with 14 false positives. False positives can be mainly attributedto the subject moving his/her face too rapidly, resulting in failure in op-tical flow and hence strain estimation. On this same dataset, the ROC ofmicro-expression spotting is given in Fig. 11. While micro-expressionspotting was less precise than macro-expression spotting, most of themicro-expressions were found. For example, the algorithm is able tosuccessfully spot 18 (roughly 50%) of micro-expressions with 4 falsepositives, and at most 29 (78%) micro-expressions at a cost of 21 falsepositives.We also report theprecision of locating the start of the expres-sion, as well as the expression length, in Tables 2 and 3 (See Table 1 forthe statistics of the corresponding ground-truth intervals).

We also report the results on 77 sequences of micro-expressionsfrom the SMIC-micro dataset in Fig. 12. It is worth noting that due to

Fig. 16. ROC for spotting on DISFA-Spontaneous expression dataset as β varies from[.01,.5].

Table 6Average error for detected start and length of expression for DISFA-Spontaneous (T = 1).


Start 70 frames ±58 framesLength 184 frames ±144 frames

Table 7Average error for detected start and length of expression for DISFA-Spontaneous (T = 3).


Start 41 frames ±38 framesLength 88 frames ±77 frames


the lower resolution of the faces,we used the facial points includedwiththis dataset to generate themask rather than usingmean-shift tracking.Overall, the algorithm performed well on this dataset, with 56 (76%) ofthemicro-expressions detected with 10 false positives (.5% FPR), and atits peak we successfully detect 64 (83%) of the expressions, at the ex-pense of 19 (.9%) false positives. For direct comparison to the results re-ported by [9] and [18], we used all 77 samples containing an expressionalong with another 77 other samples that did not. However, we test onall examples simultaneously since we do not rely on training. In total,154 sequences were classified as either containing an expression ornot. The best results reported by [9] was using a Multiple Kernel Learn-ing (MKL) at a detection rate of 71.4%. The best results for [18] on thissame dataset is 77.59%. We would like to stress that neither of the au-thors provide a FPR at this rate. At a FPR of 10%, we achieve a detectionrate of 72% and reach a maximum TPR of 84% (see Fig. 13).

5.3. Face resolution

To observe the effect of resolution on the performance of our algo-rithm, we tested the USF-combination dataset at the following scales100%, 50%, and 25%. On average this directly corresponds to facial di-mensions of roughly 300 × 310, 150 × 160, and 77 × 80. The results ofthis experiment are given in Fig. 14. A few things can be concludedfrom this experiment: (i) the algorithm is robust to re-scaled face di-mension of 150 × 60 pixels on our dataset with no significant differencein accuracy, when detectingmacro-expressions and micro-expressions.In fact at some points in the ROC curve, using a lower resolution in-creased accuracy at the same FPR (mainly because of smoother/lessnoisy flow fields); (ii) resolution is more important when detectingmicro-expressions, since an immediate overall reduction in TPR can beobserved at the 25% scale (77 × 80 pixels).

5.4. Results on hour long video that contains spontaneous facial expressions

We now report results on the USF Spontaneous Expression dataset(see Fig. 15). It is worth noting that although the FPR is significantlyhigher than the feigned expression database, spotting spontaneous facialexpressions is a much more difficult problem. The large number of falsepositives can be primarily attributed to the large variations and rapidhead motions from each subject. We also observed that the intensity of

Table 8Area under the curve (AUC) performance for all datasets.

Dataset Spotting type

USF-combination Macro spottingUSF-combination Micro spottingSMIC Micro spottingSMIC (comparison) Micro spottingUSF-spontaneous Subject 1 macro spottingUSF-spontaneous Subject 2 macro spottingDISFA-spontaneous, T = 1 Macro spottingDISFA-spontaneous, T = 3 Macro spotting

the expressions were much lower than when they are feigned. We alsotracked the mean and standard deviation of percent error for findingthe start and length of all expressions (see Tables 4 and 5). In general,the results are promising. Approximately 65% and 70% of the expressionsare found on average for subject 1 and 2, respectively. Moreover, the startof the expression is on typically found within 9 frames.

5.5. Results on DISFA database

Finally, we give results for the DISFA dataset that consists of sponta-neous macro-expression (see Fig. 16). Overall, the performance is posi-tive, however, in the case of T=1 (see Section 1 for an explanation of T)there is a significant loss in performance for expression localization (seeTables 6 and 7). This is mainly due to the nature of the expressionsfound when thresholding the action unit intensities at this point,which results in very long expression lengths that often last up to1000 frames. Since these longer expressions often contain many sub-expressions, their detections are considered successful, however, theyare penalizedwith the bounds of the greater expression interval leadingto larger localization error. By contrast, when setting T= 3 and only de-tecting expressionswith larger intensities, we achievemuch better per-formance with respect to localization that is more consistent with theresults achieved on the USF-Spontaneous dataset.

6. Performance summary and selection of beta parameter

To summarize the performance of the method we provide the AUCfor all experiments in Table 8. Overall, the results are positive, and dem-onstrate that optical strain is a viable feature for capturing both rapidmicro-expressions and the larger macro-expressions. The algorithmsrobustness to resolution is reflected in the USF-combination experi-ments where scaled-down facial resolutions performed equally as wellor only slighter worse than full HD resolution for macro-expressionspotting. Results on the SMIC dataset are promising. A decrease in over-all performance can be seen for spontaneous expressions, although onaverage for both subjects, the AUC remains above 80%.

Whilewe have tied performance to a choice of a single threshold β, apotential concern with the algorithm is making a good selection of thisparameter. In general, a higher choice of β decreases the sensitivity ofthe approach, requiring large amounts of motion/deformation to bepresent before an expression is detected. In contrast, a lower choice ofβ increases the sensitivity.

In Table 9, we provide the F1 scores at several selections of β. The F1score is a measure of accuracy that uses the weighted average of bothprecision and recall. A few conclusions may be drawn from this table.First, on the USF-combination dataset, an optimal choice of β is lessthan or equal to .3 for both macro- and micro-expression spotting. Formacro-expressions in particular, a peak size of .1 tends to be optimal,and deviating to nearby values can cause as much as a 9–14% decreasein accuracy. For micro-expressions on this dataset, it does not appear tobe as constrained, where it deviates only 1–3% in this same range. How-ever, for the high-speed spontaneous SMIC dataset, it is critical to choosethe highest sensitivity, or lowest β=.01. For the USF-Spontaneousdataset, a good choice of β is greater than or equal to .5. This implicates

AUC scale = 1 AUC scale = .5 AUC scale = .25

0.937 0.9899 0.95060.7976 0.7998 0.74040.8292 N/A N/A0.7849 N/A N/A0.8792 N/A N/A0.7542 N/A N/A.8997 N/A N/A.8734 N/A N/A

Table 9F1 scores at varying choices of beta for all experiments.

Dataset Spotting type β = 0.01 β = 0.1 β = 0.3 β = 0.5 β = 0.7 β = 0.9

USF-combination Macro spotting 0.7753 0.8684 0.7273 0.5794 0.38 0.0881USF-combination Micro spotting 0.6304 0.644 0.6747 0.5455 0.0512 0.0512SMIC Micro spotting 0.8467 0.7966 0.2162 0.0303 0.0303 0.0303SMIC (comparison) Micro spotting 0.7 0.8696 0.3056 0.0322 0.0322 0.0322USF-spontaneous Subject 1 macro spotting 0.054 0.1414 0.3164 0.4737 0.4 0.4132USF-spontaneous Subject 2 macro spotting 0.08 0.2149 0.1641 0.2857 0.4348 0.4972DISFA-spontaneous, T = 1 Macro spotting 0.8178 0.5292 0.2837 0.1544 0.0956 0.0370DISFA-spontaneous, T = 3 Macro spotting 0.7863 0.5114 0.2321 0.1367 0.1164 0.0526


that the amount of motion on the face in general is greater.We have alsoobserved in this dataset that subjects often move their head during ex-pressions, or move their head in general, leading to an increased opticalstrain magnitude signal. The same is true for the DISFA dataset. More-over, many subjects in this dataset have very long expressions withlow action unit intensities. In fact, 12 out of 27 subjects had expressionslasting longer than 600 frames,with each expression containingmultiplesub-expressions. This led to large localization error when thresholdingthe action unit intensities at T = 1. When thresholding at T = 3, thesmaller expressions were localized with greater precision.

7. Conclusions

In this paper, we proposed a method for the automatic spotting of fa-cial expressions in videos comprised of numerous facial expressionswith-out the need to train a model. The method relies on the increases anddecreases in the magnitude of the strain observed on the facial skin as asubject performs an expression. This approach is able to successfully de-tect and distinguish between regular, universal macro-expressions andrapid micro-expressions. Results are positive on several datasets, includ-ing an hour long video that contains spontaneous expressions. Themeth-od has also been shown towork at several different resolutions, with facesizes of approximately 300 × 310, 150 × 160, and 77 × 80. A few pointscan be concluded from this work. First, to the knowledge of the authors,this is the first work that addresses the problem of automaticallysegmenting videos into sequences that contain a single macro- ormicro-expression. Moreover, we are not aware of any method that cansegment an unknown number of expression types. Second, we find thatdetecting spontaneous expressions is a much more difficult problemthan feigned expression spotting,mainly because of two factors occurringsimultaneously: (1) spontaneous expressions are often obtained whileattempting to minimize the subjects knowledge or fixation on the cam-era. Hence large and semi-random rigid head motion is often present,which is typically missing in the ideal laboratory controlled settings,where often a subject's pose is meant to be stationary; (2) based on ourobservation, the expressions themselves are much more subdued whenspontaneous. This observation is also supported in [29]where the sponta-neous smile expression was of less amplitude than its feigned counter-part. Future work includes addressing the subjectivity of ground-truthlabeling in the spontaneous dataset. We will also be providing USF-Spontaneous, along with the full annotation, to the community.

References

[1] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, S. Sarkar, Towards macro- andmicro-expressions spotting in videos using strain patterns, Workshop on Applica-tions of Computer Vision, Dec. 2009.

[2] M. Shreve, S. Godavarthy, D. Goldgof, S. Sarkar, Macro- and micro-expression spot-ting in long videos using spatio-temporal strain, International Conference on Auto-matic Face and Gesture Recognition, Mar. 2011.

[3] S. Godavarthy, Micro-Expression Spotting in Video Using Optical Strain, (MastersThesis) University of South Florida, 2010.

[4] M. Shreve, N. Jain, D. Goldgof, S. Sarkar, W. Kropatsch, C.-H.J. Tzou, M. Frey, Evalua-tion of facial reconstructive surgery on patients with facial palsy using optical strain,Proceedings of the 14th international conference on Computer analysis of images

and patterns — Volume Part I, ser. CAIP'11, Springer-Verlag, Berlin, Heidelberg,2011, pp. 512–519.

[5] N. Yoder, D. Adams, M. Triplett, Multidimensional sensing for impact load and dam-age evaluation in a carbon filamentwound canister, Journal onMaterials Evaluation,vol. 66, no. 7, Jan. 2008, pp. 756–763.

[6] Z. Zeng, Y. Fu, G.I. Roisman, Z. Wen, Y. Hu, T.S. Huang, Spontaneous emotional facialexpression detection, Journal of Multimedia, vol. 1, 2006, pp. 1–8.

[7] N. Sebe, M. Lew, Y. Sun, I. Cohen, T. Gevers, T. Huang, Authentic facial expressionanalysis, Image Vis. Comput. 25 (12) (2007) 1856–1863.

[8] Y. Tong, J. Chen, Q. Ji, A unified probabilistic framework for spontaneous facial actionmodeling and understanding, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2) (2010)258–273.

[9] T. Pfister, X. Li, G. Zhao, M. Pietikinen, Recognising spontaneous facial micro-expressions, Proceedings of the International Conference on Computer Vision, 2011.

[10] S. Polikovsky, Y. Kameda, Y. Ohta, Facial micro-expression detection in hi-speedvideo based on facial action coding system (FACS), IEICE Trans. 96-D (1) (2013)81–92.

[11] Q.Wu, X. Shen, X. Fu, Themachine knowswhat you are hiding: an automatic micro-expression recognition system, Proceedings of the 4th International Conference onAffective Computing and Intelligent Interaction—Volume Part II, 2011, pp. 152–162.

[12] P. Ekman, E.T. Rolls, D.I. Perrett, H.D. Ellis, Facial expressions of emotion: an old con-troversy and new findings [and discussion], Philosophical Transactions: BiologicalSciences, vol. 335, 1992, pp. 63–69.

[13] P. Ekman, Telling Lies: Clues to Deceit in theMarketplace, Politics, andMarriage (Re-vised and Updated Edition), W.W. Norton and Company, 2001.

[14] J. Sung, D. Kim, Real-time facial expression recognition using STAAM and layeredGDA classifier, Image Vis. Comput. 27 (9) (2009) 1313–1325.

[15] G. Littlewort, M.S. Bartlett, I. Fasel, J. Susskind, J. Movellan, Dynamics of facial expres-sion extracted automatically from video, Image Vis. Comput. 24 (6) (2006) 615–625.

[16] S. Chen, Y. Tian, Q. Liu, D.N. Metaxas, Recognizing expressions from face and bodygesture by temporal normalized motion and appearance features, Image Vis.Comput. 31 (2) (2013) 175–185.

[17] C. Shan, S. Gong, P.W. McOwan, Facial expression recognition based on local binarypatterns: a comprehensive study, Image Vis. Comput. 27 (6) (2009) 803–816.

[18] J.A. Ruiz-Hernandez, M. Pietikainen, Encoding local binary patterns using the re-parametrization of the second order Gaussian jet, Automatic Face and Gesture Recog-nition (FG), 2013 10th IEEE International Conference andWorkshops, 2013, pp. 1–6.

[19] L. Jeni, J. Girard, J. Cohn, F. De la Torre, Continuous AU intensity estimation using lo-calized, sparse facial feature space, Automatic Face and Gesture Recognition (FG),2013 10th IEEE International Conference and Workshops, April 2013, pp. 1–7.

[20] N. Zaker, M. Mahoor, W. Mattson, D. Messinger, J. Cohn, Intensity measurement ofspontaneous facial actions: evaluation of different image representations, Develop-ment and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Confer-ence, Nov. 2012, pp. 1–2.

[21] S.M. Mavadati, M.H. Mahoor, K. Bartlett, P. Trinh, J.F. Cohn, DISFA: a spontaneous fa-cial action intensity database, IEEE Trans. Affect. Comput. 4 (2) (2013) 151–160.

[22] F. Zhou, F. De la Torre, J.F. Cohn, Unsupervised discovery of facial events, IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 2010.

[23] S. Liwicki, S. Zafeirious, M. Pantic, Incremental slow features analysis with indefinitekernel for online temporal video segmentation, Proceedings of the 11th Asian Con-ference on Computer Vision, ACCV 2012, Daejeon, Korea, ser. Lecture Notes in Com-puter Science, Springer Verlag, London, November 2012, pp. 162–176.

[24] M. Pantic, L. Rothkrantz, Automatic analysis of facial expressions: the state of the art,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12,2000, pp. 1424–1445.

[25] C. Padgett, G. Cottrell, Representing face images for emotion classification, Advancesin Neural Information Processing Systems, 1996, pp. 894–900.

[26] M.J. Black, Y. Yacoob, Recognizing facial expressions in image sequences using localparameterized models of image motion, 251997. 23–48.

[27] M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric andpiecewise-smooth flow fields, Computer Vision and Image Understanding, vol. 63,no. 1, Elsevier Science Inc., New York, NY, USA, 1996, pp. 75–104.

[28] J.M. Saragih, S. Lucey, J.F. Cohn, Face alignment through subspace constrainedmean-shifts, International Conference of Computer Vision, September 2009.

[29] J.F. Cohn, K.L. Schmidt, The timing of facial motion in posed and spontaneous smiles,J. Wavelets, Multi-resolution and Information Processing, vol. 2, 2004, pp. 1–12.

http://refhub.elsevier.com/S0262-8856(14)00078-X/rf0070













































































automatic expression spotting in videos

Documents