a facial expression recognition system using convolutional...

A Facial Expression Recognition System UsingConvolutional Networks

Andre Teixeira Lopes, Edilson de Aguiar, Thiago Oliveira-Santos*Federal University of Espirito Santo, Brazil

*Corresponding Author: [email protected]

Fig. 1. This work proposes a simple solution for facial expression recognition that uses a combination of standard methods, like Convolutional Networkand specific image pre-processing steps. To the best of our knowledge, our method achieves the best result in the literature, 97.81% of accuracy, and takesless time to train than state-of-the-art methods.

Abstract—Facial expression recognition has been an activeresearch area in the past ten years, with a growing applicationarea like avatar animation and neuromarketing. The recognitionof facial expressions is not an easy problem for machine learningmethods, since different people can vary in the way that theyshow their expressions. And even an image of the same person inone expression can vary in brightness, background and position.Therefore, facial expression recognition is still a challengingproblem in computer vision. In this work, we propose a simplesolution for facial expression recognition that uses a combinationof standard methods, like Convolutional Network and specificimage pre-processing steps. Convolutional networks, and the mostmachine learning methods, achieve better accuracy dependingon a given feature set. Therefore, a study of some imagepre-processing operations that extract only expression specificfeatures of a face image is also presented. The experiments werecarried out using a largely used public database for this problem.A study of the impact of each image pre-processing operationin the accuracy rate is presented. To the best of our knowledge,our method achieves the best result in the literature, 97.81%of accuracy, and takes less time to train than state-of-the-artmethods.

Keywords-Facial Expression; Convolutional Networks; Com-puter Vision; Machine Learning; Expression Specific Features;

I. INTRODUCTION

Facial expression is one of the most important features ofhuman emotion recognition [1]. It was introduced as a researchfield by Darwin in his book ”The Expression of the Emotionsin Man and Animals” [2]. According to Li an Jain [3], itcan be defined as the facial changes in response to person’sinternal emotional state, intentions, or social communications.Automated facial expression recognition has a large variety ofapplications nowadays, such as data-driven animation, neuro-marketing, interactive games, entertainment, sociable robotics

and many others human-computer interaction systems.

Expression recognition is a task that humans perform dailyand effortlessly [3], but that is not yet easily performedby computers. A lot of work has tried to make computersreach the same accuracy as humans, and some examples ofthese works are highlighted here. Facial expression recognitionsystems can be divided in two main categories, those thatwork with static images [4], [5], [6] and those that work withdynamic image sequences [7], [8]. Static-based methods donot use temporal information, i.e. the feature vector comprisesinformation about the current input image only. Sequencebased methods, in the other hand, use temporal informationof images to recognize the expression based on one or moreframes. Automated systems for facial expression recognitionreceive the expected input (static image or image sequence)and give as output one of the six basic expressions (anger, sad,surprise, happy, disgust and fear, for example). Some systemsalso recognize the neutral expression. This work will focus onmethods based on static images and it will consider the sixbasic expressions only.

As described in [3], automatic facial expression analysiscomprises three steps: face acquisition, facial data extractionand representation, and facial expression recognition. Faceacquisition can be separated in two major steps: face detec-tion [9], [10], [11], [12] and head pose estimation [13], [14],[15]. After the face is located, the facial changes caused byfacial expressions need to be extracted. These changes are usu-ally represented as geometric feature-based methods [16], [17],[18], [12] or appearance-based methods [16], [6], [19]. Geo-metric feature-based methods work with shape and location offacial components like mouth, eyes, nose and eyebrows. The

feature vector that represents the face geometry is composed offacial components or facial feature points. Appearance-basedmethods work with feature vectors extracted from the wholeface, or from specific regions, and are acquired using imagefilters applied to the face image [3].

The facial expression recognition is the last stage of thissystem. According to Liu et. al. [4], expression recognitionsystems basically use a three-stage training procedure: featurelearning, feature selection and classifier construction, in thisorder. The feature learning stage is responsible for the extrac-tion of all features related to the facial expression variation.The feature selection chooses the best features to representthe facial expression. They should minimize the intra-classvariation of expressions while maximizing the inter-class vari-ation [5]. Minimizing the intra-class variation of expressionsis a problem because images of different individuals with thesame expression are far from each other in the pixel’s space.Maximizing the inter-class variation is also difficult becauseimages of the same person in different expressions may bevery close to each other in the pixel’s space [20]. At the endof the process, one classifier (or more classifiers with one foreach expression) are used to infer the facial expression, giventhe selected features.

Recently, a lot of work has been employed in the facialexpression recognition research field [4], [5], [12]. Methodsusing convolutional neural networks (CNN) for face recog-nition [21] can also be found in the literature. CNN’s havea high computation cost in terms of memory and speed, butcan achieve some degree of shift and deformation invarianceand are also highly parallelizable. This network type hasdemonstrated being able to achieve high recognition rates invarious image recognition tasks like character recognition [22],handwritten digit recognition [23], object recognition [24], andfacial expression recognition [25], [26], [27], [7].

Although there are many methods proposed in the literature,some points still deserve attention, for example, accuracy ratecould be higher in [28], [1], validation methods could beimproved in [13], [28], [25] and others limitations in general.Trying to cope with some of these limitations while keeping asimple solution, we present an approach combining standardmethods, like image normalizations, synthetic training sam-ples (i.e. real images with artificial rotations) generation andConvolutional Network, into a simple solution that is able toachieve a very high accuracy rate (97.81%) as can be seen inFig. 1. Our training and experiments were carried out using theExtensive Cohn-Kanade (CK+) database of static images [29]that is largely used in the literature. In addition, we presenta complete validation method and a reduced training timecompared with some of the state-of-the-art methods.

The remainder of this paper is organized as follows: the nextsection presents the most recent related works that are followedby a description of the proposed approach. In section IV,the experiments are explained, the results are presented andcompared with the state-of-the-art. Finally, a conclusion ispresented.

II. RELATED WORK

There have been several expression recognition approachesdeveloped in the last decade and a lot of progress has beenmade in this research area recently. A full survey can be foundin [3], [30]. This section focuses on methods closely relatedto the approach proposed in this work.

In [4], the authors propose a novel approach called BoostedDeep Belief Network (BDBN). Their approach performs thethree learning stages (feature learning, feature selection andclassifier construction) iteratively in a unique framework. TheBDBN focus on the six basic expressions. Their experimentswere conducted using two public databases of static images,Cohn-Kanade [29] and JAFFE [31], and achieved an accuracyof 96.7% and 68.0%, respectively. The training and the testingadopted a one-versus-all classification strategy, creating abinary classifier for each expression. The time required totrain the network was about 8 days. The online recognitionis calculated in function of the weak classifiers. In theirmethod they use seven classifiers, one for each expression.Each classifier took 30 ms to recognize each expression, witha total online recognition time of about 0.21 s.

In [5], the authors perform a deep study using LocalBinary Patterns (LBP) as feature extractor. They compare andcombine different machine learning techniques like templatematching, Support Vector Machine, Linear Discriminant Anal-ysis and linear programming to recognize facial expressions.The authors also conduct a study to analyze the impact ofimage resolution in the accuracy result and conclude thatmethods based on geometric features do not handle lowresolution images very well, while those based on appearance,like Gabor Wavelets and LBP, are not so sensible to the imageresolution. The best result achieved in their work was anaccuracy rate of 95.1% using SVM and LBP in the Cohn-Kanade [29] database. The testing setting used was a 10-fold cross validation scheme. The training time and the onlinerecognition time was not mentioned by the authors.

An approach using Histogram of oriented Gradients (HoG)features is presented in [32]. The authors used a set of featureextractors like LBP, PCA and HoG with a SVM to classifystatic face images as one of the six expressions. The proposedmethod achieves an accuracy rate of 70% when classifyingonly five expressions (anger, fear, joy, relief and sadness).The training time and the online recognition time was notmentioned by the authors.

Convolutional neural networks (CNN) were firstly used byLecun at al. [33], which was inspired by the early work ofHubel and Wiesel [34]. One of the main advantages of CNN isthat the models’ input is a raw image rather than hand-codedfeatures. CNNs are able to learn the set of features that bestmodel the desired classification. In general, this type of hier-archical network has alternating types of layers, convolutionallayers and sub-sampling layers. CNNs architectures vary inhow convolutional and sub-sampling layers are applied andhow the networks are trained. Convolutional layers are mainlyparametrized by the number of generated maps and the kernels

Fig. 2. Overview of the proposed facial expression recognition system. The system is divided in two main steps: training and testing. The training steptakes as input an image with a face and its eyes location. Then, a spatial normalization is carried out to align the eyes with the horizontal axis. After that,new images are syntheticly generated with rotation angles based on a Gaussian distribution to increase the database size. Then, a cropping is done usingthe inter-eyes distance to remove background information keeping only expression specific features. A downsampling procedure is carried out to get thefeatures in different images in the same location. Thereafter, an intensity normalization is applied to the image. The normalized images are used to train theConvolutional Network. The output of the training step is a set of weigths that achieve the best result with the training data. The testing step use the samemethodology as the training step: spatial normalization, cropping, downsampling and intensity normalization. Its output is a single number that represents oneof the six basic expressions.

size. The kernel is shifted over the valid region of the inputimage generating one map. The learning procedure of CNNsconsists of getting the best weights associated to the kernel.The learning can use a descendant gradient method, like theone proposed in [33]. Sub-sampling layers are used to increasethe position invariance of the kernels [35] by reducing the mapsize. The new reduced map is generated considering a functionof each pixel neighborhood in the higher resolution version.The main types of sub-sampling layers are maximum-poolingand average pooling [35]. In the maximum-pooling, the newmap will keep only the maximum pixel value of the respectiveneighborhood region, whereas in the average pooling the newpixel value will be an average of the neighbors.

In [27], a method that uses a combination of two methods,convolutional networks to detect faces and a rule based algo-rithm to recognize the expression, is presented. The rule basedalgorithm is proposed to enhance the subject independentfacial expression recognition. This procedure uses rules likedistance between eyes and mouth, length of horizontal linesegment in mouth and length of horizontal line segment ineyebrows. The experiments were carried out with 10 personsonly and the authors report the results of detecting smilingfaces only (i.e. happiness), which was 97.6%. The trainingtime and the online recognition time was not mentioned bythe authors.

A video-based facial expression recognition system is pro-posed in [7]. They developed a 3D-CNN having an imagesequence (from the neutral to the final expression) using 5successive frames as 3D inputs. Therefore, the CNN inputwill be H ∗W ∗5 - where H and W are the image height andwidth respectively, and 5 is the number of frames. The authorsclaim that the 3-D CNN method can handle some degrees

of shift and deformation invariance. With this approach theyachieved an accuracy of 95%, but the method relies on asequence containing the full movement from the neutral to theexpression. The experiments were carried out with 10 personsonly. The training time and the online recognition time wasnot mentioned by the authors.

III. FACIAL EXPRESSION RECOGNITION SYSTEM

In this section, an efficient method that performs the threelearning stages in just one classifier (CNN) is presented. Theonly additional step required is a pre-process to normalize theimages. The proposed method comprises two main phases:training and testing. During training, the system receives adatabase of grayscale image of faces with their respectiveexpression and eye center locations and learns a set of weightsthat better separates the facial expressions for classification.During test, the classifier receives a grayscale image of a facealong with the respective eye center locations, and outputsthe predicted expression by using the weights learned duringtraining.

An overview of the method is illustrated in Fig. 2. Thetraining and the testing have slightly different workflows. Fortraining, the system applies a sequence of: spatial normaliza-tion, synthetic samples generation, image cropping, downsam-pling and intensity normalization. For recognizing an unknownimage (i.e. testing phase), the system applies a sequence of:spatial normalization, image cropping, downsampling and in-tensity normalization. The only difference between the imagepre-processing steps of training and testing is the syntheticsamples generation that is used in training only. The output ofthe trained CNN is a single label number that express one ofthe six basic expressions.

A. Spatial NormalizationThe images in the database vary in rotation, brightness and

size even for images of the same person. These variationsare not related to the face expression and can affect theaccuracy rate of the system. To address this problem, aspatial normalization of the face region is performed to correctpossible geometric issues like rotations. Basically, a rotation isapplied to align the eyes with the horizontal axis of the image.This rotation makes the angle formed by the line segmentgoing from one eye center to the other, and the horizontalaxis to be zero. Rotations in the images are not related tothe facial expression and therefore should be removed toavoid negatively affecting the accuracy rate of the system. Thespatial normalization procedure is shown in Fig. 3.

Fig. 3. Spatial Normalization example. The non-corrected input image (left)is rotated (right) using the line segment going from one eye center to theother (red line) and the horizontal axis (blue line).

B. Synthetic Sample GenerationOne of the main problems of CNN methods is that they usu-

ally need of a lot of data in the training phase to achieve a goodaccuracy rate. To address this problem Simard et. al. in [36]propose the generation of synthetic images (i.e. real imageswith artificial rotations) to increase the database. The authorsshow the benefits of applying combinations of translations,rotations and skewing for increasing the database. Followingthis idea, in this paper, we use a 2D Gaussian distribution (σ =3◦) to introduce random noise in the locations of the center ofthe eyes. Synthetic images are generated by considering thenormalized versions of noisy eyes locations. For each image,30 synthetic images are generated.

As it can be seen in Fig. 4, a Gaussian distribution of pointsare generated for both eyes. Using this distribution, the newlocations are concentrated in regions near the original pointsand just few locations are far away the original points. Basedon the angle generated by pairs of synthetic points (one forthe left and one for the right), the image is rotated aroundthe middle point between the synthetic eye centers. Using aGaussian distribution with a small standard deviation (σ =3), the majority of the synthetic samples are generated veryclose to the original one resulting mostly small rotation angles.Otherwise, it could affect the learning method and decrease theaccuracy of the CNN if too many images with high rotationangles (e.g. 90◦, 180◦, etc) were used.

C. Image CroppingAs shown in Fig. 2, the original image has a lot of

background information that is not important to the expression

Fig. 4. Illustration of the synthetic sample generation. The Gaussian syntheticsample generation procedure increases the database size and variation, addingrotation noise θ in the images considering a controlled environment. The angleof rotations are small and are generated by a pair of points coming from twodifferent Gaussian distributions with σ = 3◦, one for each eye. The greencrosses are the original eyes points, while the red ones are the synthetic points.

classification procedure. This information could decrease theaccuracy of the classification because the classifier has oneproblem more to solve, discriminating between backgroundand foreground. After the cropping, all image parts that donot have expression specific information are removed. Thecropping region also tries to remove facial parts that do notcontribute for the expression (e.g. ears, part of the forehead,etc.). Therefore, the region of interest is defined based ona ratio of the inter-eyes distance. Consequently, our methodis able to handle different persons and image sizes withouthuman intervention. To get the region shown in Fig. 5, a 4.5multiplying factor based on the inter-eyes distance was usedin the vertical axis and a 2.4 factor was used in the horizontalaxis. These values were empirically chosen.

Fig. 5. Image cropping example. The spatial normalized input image (left)is cropped (right) to remove all non-expression features, such as backgroundand hair.

D. Downsampling

The downsampling operation is performed to ensure thesame location for the face components (eyes, mouth, eyebrow,etc) of every image. This procedure helps the CNN to learnwhich regions are related to each specific expression. Thedownsampling also enables the convolutions to be performedin the GPU since most of the graphics card nowadays havelimited memory. The final image is 32x32 pixels.

E. Intensity Normalization

The image brightness and contrast can vary even in im-ages of the same person in the same expression increasing,therefore, the variation in the feature vector. Such variationsincrease the complexity of the problem that the classifier hasto solve for each expression. In order to reduce these issues anintensity normalization was applied. A method adapted from abio-inspired technique described in [37] was used. Basically,the normalization is a two step procedure: firstly a subtractivelocal contrast normalization is performed; and secondly, adivisive local contrast normalization is applied. In the firststep, the value of every pixel is subtracted from a Gaussian-weighted average of its neighbors. In the second step, everypixel is divided by the standard deviation of its neighborhood.The neighborhood for both procedures uses a kernel of 7x7pixels (empirically chosen). An example of this procedure isillustrated in Fig. 6.

Fig. 6. Illustration of the intensity normalization. The figure shows the imagewith the original intensity (left) and its intensity normalized version (right).

Equation 1 shows how each new pixel value is calculatedin the intensity normalization procedure:

x′ =x− µnhg x

σnhg x(1)

where, x′ is the new pixel value, x is the original pixel value,µnhg x is the Gaussian-weighted average of the neighbors ofx, and σnhg x is the standard deviation of the neighbors of x.

F. Convolutional Network

The architecture of our Convolutional Network is repre-sented in the Fig. 7. The network receives as input a 32x32grayscale image and outputs the confidence of each expression.The class with the maximum value is used as the expressionin the image. Our CNN architecture comprises 2 convolutionallayers and 2 subsampling layers. The first layer of the CNN isa convolution layer, that applies a convolution kernel of 7x7and outputs an image of 28x28 pixels. This layer is followed

by a subsampling layer that uses max-pooling (with kernelsize 2x2) to reduce the image to half of its size. Subsequently,a new convolution is applied to the feature vector and isfollowed by another subsampling. The output is given to afully connected layer that has 256 neurons. The network hassix output nodes (one for each expression that outputs theirconfidence level) that are fully connected to the previous layer.

Fig. 7. Architecture of the proposed Convolutional Neutral Network. Itcomprises five layers: the first layer (convolution type) outputs 32 maps;the second layer (subsampling type) reduces the map size by half; the thirdlayer (convolution type) outputs 64 maps for each input; the fourth layer(subsampling type) reduces the map once more by half; the fifth layer (fullyconnected type) and the final output with 6 nodes representing each one ofthe expression are responsible for classifying the facial image.

IV. RESULTS AND DISCUSSION

The experiments were performed using the Extended Cohn-Kanade (CK+) database [29], and using the training andtesting methodology described in [4]. Accuracy is computedconsidering one classifier to classify all learned expressions.In addition, in order to perform a fair comparison with themethod proposed in [4], accuracy is also computed consideringone binary classifier for each expression.

The implementation of the normalization steps was donein-house using C++ and OpenCV, and we used a GPU basedCNN framework developed in C++ together with a Matlabwrapper as classifier (developed by Demyanov et. al. [38]).All the experiments were carried out using an Intel Core i73.4 GHz with a NVIDA GeForce GTX 660 CUDA Capablethat has 1.5Gb of memory in the GPU. The normalizationand classification steps took in avarage 0.11 and 0.23 secondrespectively. The convolutional network parameters are pre-sented in Table

In this Section, a study is carried out showing the impact ofevery normalization step in the accuracy of the method. Firstly,we describe the database used for the experiments. Secondly,each experiment is presented and discussed in details. Thirdly,a comparison with the state-of-the-art methods is presented.Finally, the limitations are discussed.

TABLE ICNN PARAMETERS

Parameter ValueEpochs 300Loss Function Logistic RegressionMomentum 0.95Learning Rate 0.05

A. Test Database

The presented system was trained and tested using theExtended Cohn-Kanade dataset (CK+) [29]. This dataset com-prises 100 university students with age between 18 and 30years old. The subjects in the dataset are 65% female, 15%are African-American and 3% are Asian or south American.The images were captured from a camera located directly infront of the subject. The students were instructed to performa series of expressions. Each sequence begins and ends withthe neutral expression. All images in the dataset are 640 by480 pixel arrays with 8-bit precision for grayscale values.

To do a fair comparison with the state-of-the-art methods[4], [5], in our experiments the neutral and the contemptexpression images were not used. In addition, only the last 3frames of each expression sequence were selected to composethe training/testing dataset. The dataset was divided in 8groups without subject overlap between groups. This method-ology was used in [4] to ensures that the testing groups donot have subjects from the training group.

B. Experiments

a) No Preprocessing: This first experiment was carriedout using the original database, without any intervention orimage pre-processing. The training and the testing for this,and all subsequent experiments, use a 8-Fold procedure. Thetraining was performed 8 times, each time leaving one groupout for tests. The training of the proposed CNN uses a gradientdescendant method. Gradient descendant methods rely on theorder of the given samples to search for the local minimum. Toavoid this variation, each experiment configuration is run 10times (training and testing) with different image presentationorder. With this change, the descending gradient method cantake additional paths and increase (or decrease) the recognitionrate. We present the accuracy as an average and as the bestvalue for all 10 runs. Using the accuracy as an average, weshow that if even the presentation order is not the best, themethod still achieves high accuracy rates. In this experiment,the best accuracy of all 10 runs was 61.70%, and the averagewas 56.93%. The average accuracy per expression is shown inTable II. This average is calculated using the amount of hitsin all runs divided by the amount of all images in all runs.

As it can be seen in Table II, using only the CNN withoutany image pre-processing, the recognition rate is very lowcompared to the state-of-the-art methods.

b) Spatial Normalization: As explained in Section III, aspatial normalization is applied to get the features in the samepixel space, in both training and testing steps. This normaliza-tion took only 0.09 seconds to be applied to an image. This

experiment is done using the same methodology describedbefore. The best accuracy of all 10 runs was 90.00%, andthe average was 86.44%. The average accuracy per expressionis shown in Table II. Compared with the result shown before,we can note a significantly increase of the recognition rate byadding the spatial normalization and cropping processes.

c) Intensity Normalization: The intensity normalizationis used to remove brightness variation in the images in bothsteps, training and testing. This experiment was performedusing just the intensity normalization and cropping. It uses thesame methodology described before. This normalization tookonly 0.02 second to be applied to an image. The best accuracyof all 10 runs was 86.10%, and the average was 82.39%. Theaverage accuracy per expression is shown in Table II.

d) Spatial Normalization and Intensity Normalization:Putting both normalization steps together, spatial and intensity,we remove a big part of the variations unrelated to the facialexpression leaving just the expression specific variation thatis not related to the person. This experiment is done usingthe same methodology described before. The normalizationstogether take 0.11 seconds to be performed. The best accuracyof all 10 runs was 87.81%, and the average was 84.68%. Theaverage accuracy per expression is shown in Table II.

As it can be seen, the accuracy of applying both normal-ization procedures is lower than the one that uses only thespatial normalization. The result of the disgust and happyexpressions have a very low accuracy, which reduces theoverall recognition average. To verify the need for the inten-sity normalization, a new experiment using only the spatialnormalization procedure and the synthetic sample generationwas performed and is presented below.

e) Spatial Normalization and Synthetic Samples: Theresult of the spatial and intensity normalization and only thespatial normalization gives a false impression that the intensitynormalization might decrease the accuracy of the method -since the result of applying only the spatial normalizationis better than the result with both normalizations. To verifythis suspicion, a new experiment was conducted, using onlythe spatial normalization procedure and the synthetic sam-ples. This experiment is done using the same methodologydescribed before. The best accuracy of all 10 runs was90.83%, and the average was 87.54%. The average accuracyper expression is shown in Table II.

f) Spatial Normalization, Intensity Normalization andSynthetic Samples: The best result achieved in our methodapplies the three image pre-pocessing steps: spatial normaliza-tion, intensity normalization and synthetic samples generation.For the synthetic sample generation, thirty more samples weregenerated for each image. This experiment is done using thesame methodology described before. The total training time forthis experiment was four and a half days. The best accuracyof all 10 runs was 93.74%, and the average was 91.46%.The average accuracy per expression is shown in Table II.The accuracy of this experiment shows that joining the threetechniques (spatial normalization, intensity normalization andsynthetic samples) is better than using only the spatial nor-

malization and the synthetic samples presented previously.Table II shows the mean accuracy for each expression

using all the preprocessing steps already discussed. In a) nopreprocessing is used, in b) just the spatial normalization isemployeed, in c) just the intensity normalization is used, in d)both normalizations (spatial and intensity) are used, in e) thespatial normalization using the synthetic samples are used andin f ) both normalizations and the synthetic samples are used.

TABLE IIPREPROCESSING STEPS ACCURACY DETAILS

Angry Disgust Fear Happy Sad Surprisea) 31.40% 63.67% 11.60% 68.30% 24.64% 81.08%b) 79.70% 87.40% 68.00% 94.00% 64.52% 96.06%c) 70.14% 90.50% 43.00% 94.78% 53.21% 94.65%d) 90.96% 44.40% 95.16% 57.38% 94.73% 84.18%e) 80.22% 88.98% 71.86% 95.74% 69.04% 94.61%f ) 82.22% 96.55% 74.93% 98.06% 76.57% 97.38%

C. Comparisions

Table III summarizes all results presented in this section,showing the evolution of the proposed method by changingthe image pre-processing steps.

TABLE IIIPREPROCESSING COMPARISION

Preprocessing Average BestNone 56.93±5% 61.70%Spatial Normalization 86.44±3% 90.00%Intensity Normalization 82.39±1% 86.10%Both Normalizations 84.68±2% 87.81%Spatial Normalization and Synthetic Samples 87.54±1% 90.83%Both Normalizations and Synthetic Samples 91.46±1% 93.74%

As it can be seen in Table III, the best performancewas achieved using both, normalization procedures and thesynthetic samples. Using the training weights obtained withthe best accuracy, the confusion matrix shown in Table IVwas built. The recognition of the disgust, happy and surpriseexpressions achieves an accuracy rate higher than 97%. Whilethe angry and fear expression was about 85%. The sadexpression achieves the smallest recognition rate, with only79.76%. The fear expression was confused in the majority ofthe time with the sad expression. This shows that the featuresof these two expression are not well separated in the pixelspace, i.e. they are very similar to each other in some cases.

TABLE IVCONFUSION MATRIX USING BOTH NORMALIZATIONS AND SYNTETIC

SAMPLES

Angry Disgust Fear Happy Sad SurpriseAngry 85.19% 0.56% 0% 0.48% 3.57% 0%Disgust 3.70% 97.74% 0% 0% 0% 0%Fear 5.19% 0% 85.33% 0% 9.25% 0%Happy 0% 0% 5.33% 98.55% 0% 1.20%Sad 3.70% 0% 4.00% 0.97% 79.76% 0%Surprise 2.22% 1.69% 5.33% 0% 7.14% 98.80%

As discussed before, we adapted our method to performbinary classifications for each expression, i.e. do a one-versus-all classification as in [4]. In this method, all images are

presented to six binary classifiers, each one responsible torecognize the presence, or not, of one specific expression. Forexample, if we present an image of a subject smiling, just theclassifier of the happy expression should answers ”yes”, and allother classifiers should answer ”no”. Using this methodology,our accuracy was 97.81%. The comparison of this accuracywith other methods is shown in Table V.

TABLE VPERFORMANCE COMPARISON

Method PerformanceCSPL [12] 89.90%AdaGabor [39] 93.30%3D-CNN [7] 95.00%LBPSVM [5] 95.10%BDBN [4] 96.70%Our Method 97.81%

D. Limitations

As discussed before, the presented method needs the lo-cations of each eye for the image pre-processing steps. Theeye detection can be easily included to the system adoptingthe method shown in [40]. In addition, as shown in TableII, the accuracy of some expressions, like fear and sad, wasless than 80%, while the accuracy of the whole method wasabout 93%. This suggests that the variation between theseclasses are not enough to separate them. One approach toaddress this problem is to create a specialized classifier forthose expressions, to be used as a second classifier.

V. CONCLUSION

In this paper, we propose a facial expression recognitionsystem that uses a combination of standard method, like Con-volutional Network and specific image pre-processing steps.Experiments showed that the combination of the normalizationprocedures improves significantly the method’s accuracy. Asshown in the results, in comparison with the state-of-the artmethods that use the same facial expression database, ourmethod achieves a better accuracy, and presents a simplersolution. In addition, it takes less time to train. As future work,we want to test this approach in others databases, and performa cross database validation.

ACKNOWLEDGMENT

We would like to thank Universidade Federal doEspırito Santos - UFES (project SIEEPEF, 5911/2015),Fundacao de Amparo Pesquisa do Espırito Santo - FAPES(grants 65883632/14, 53631242/11, and 60902841/13) andCoordenacao de Aperfeicoamento de Pessoal de Nıvel Supe-rior - CAPES (grant 11012/13-7, and scholarship).

REFERENCES

[1] Y. Wu, H. Liu, and H. Zha, “Modeling facial expression space forrecognition,” in 2005 IEEE/RSJ International Conference on IntelligentRobots and Systems, 2005. (IROS 2005), 2005, pp. 1968–1973.

[2] C. Darwin, The Expression of the Emotions in Man and Animals.CreateSpace Independent Publishing Platform, 2012.

[3] S. Z. Li and A. K. Jain, Handbook of Face Recognition. SpringerScience & Business Media, 2011.

[4] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition viaa boosted deep belief network,” in 2014 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2014, pp. 1805–1812.

[5] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognitionbased on local binary patterns: A comprehensive study,” Image andVision Computing, vol. 27, no. 6, pp. 803–816, 2009.

[6] W. Liu, C. Song, and Y. Wang, “Facial expression recognition based ondiscriminative dictionary learning,” in 2012 21st International Confer-ence on Pattern Recognition (ICPR), 2012, pp. 1839–1842.

[7] Y.-H. Byeon and K.-C. Kwak*, “Facial expression recognition using3d convolutional neural network,” in International Journal of AdvancedComputer Science and Applications, vol. 5, no. 12, 2014.

[8] J.-J. J. Lien, T. Kanade, J. Cohn, and C. Li, “Detection, tracking, andclassification of action units in facial expression,” Journal of Roboticsand Autonomous Systems, 1999.

[9] C.-R. Chen, W.-S. Wong, and C.-T. Chiu, “A 0.64 mm real-time cascadeface detection design based on reduced two-field extraction,” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 19,no. 11, pp. 1937–1948, 2011.

[10] C. Garcia and M. Delakis, “Convolutional face finder: a neural architec-ture for fast and robust face detection,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 26, no. 11, pp. 1408–1423, 2004.

[11] Z. Zhang, D. Yi, Z. Lei, and S. Li, “Regularized transfer boosting forface detection across spectrum,” IEEE Signal Processing Letters, vol. 19,no. 3, pp. 131–134, 2012.

[12] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movel-lan, “Recognizing facial expression: machine learning and applicationto spontaneous behavior,” in IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2,2005, pp. 568–573 vol. 2.

[13] P. Liu, M. Reale, and L. Yin, “3d head pose estimation based on sceneflow and generic head model,” in 2012 IEEE International Conferenceon Multimedia and Expo (ICME), 2012, pp. 794–799.

[14] W. W. Kim, S. Park, J. Hwang, and S. Lee, “Automatic head poseestimation from a single camera using projective geometry,” in Com-munications and Signal Processing (ICICS) 2011 8th InternationalConference on Information, 2011, pp. 1–5.

[15] M. Demirkus, D. Precup, J. Clark, and T. Arbel, “Multi-layer temporalgraphical model for head pose estimation in real-world videos,” in 2014IEEE International Conference on Image Processing (ICIP), 2014, pp.3392–3396.

[16] Z. Zhang, M. Lyons, M. Schuster, and S. Akamatsu, “Comparisonbetween geometry-based and gabor-wavelets-based facial expressionrecognition using multi-layer perceptron,” in Third IEEE InternationalConference on Automatic Face and Gesture Recognition, 1998. Proceed-ings, 1998, pp. 454–459.

[17] P. Yang, Q. Liu, and D. Metaxas, “Boosting coded dynamic featuresfor facial action units and facial expression recognition,” in IEEEConference on Computer Vision and Pattern Recognition, 2007. CVPR’07, 2007, pp. 1–6.

[18] S. Jain, C. Hu, and J. Aggarwal, “Facial expression recognition withtemporal modeling of shapes,” in 2011 IEEE International Conferenceon Computer Vision Workshops (ICCV Workshops), pp. 1642–1649.

[19] Y. Lin, M. Song, D. T. P. Quynh, Y. He, and C. Chen, “Sparse codingfor flexible, robust 3d facial-expression synthesis,” IEEE ComputerGraphics and Applications, vol. 32, no. 2, pp. 76–88, 2012.

[20] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza, “Dis-entangling factors of variation for facial expression recognition,” inComputer Vision ECCV 2012, ser. Lecture Notes in Computer Science,A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds.Springer Berlin Heidelberg, 2012, no. 7577, pp. 808–822.

[21] S. Lawrence, C. Giles, A. C. Tsoi, and A. Back, “Face recognition: aconvolutional neural-network approach,” IEEE Transactions on NeuralNetworks, vol. 8, no. 1, pp. 98–113, 1997.

[22] G. Lv, “Recognition of multi-fontstyle characters based on convolutionalneural network,” in 2011 Fourth International Symposium on Computa-tional Intelligence and Design (ISCID), vol. 2, 2011, pp. 223–225.

[23] X.-X. Niu and C. Y. Suen, “A novel hybrid CNNSVM classifier forrecognizing handwritten digits,” Pattern Recognition, vol. 45, no. 4, pp.1318–1325, 2012.

[24] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for genericobject recognition with invariance to pose and lighting,” in Proceedings

of the 2004 IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2004. CVPR 2004, vol. 2, 2004, pp. II–97–104 Vol.2.

[25] B. Fasel, “Robust face analysis using convolutional neural networks,” in16th International Conference on Pattern Recognition, 2002. Proceed-ings, vol. 2, 2002, pp. 40–43 vol.2.

[26] F. Beat, “Head-pose invariant facial expression recognition using con-volutional neural networks,” in Fourth IEEE International Conferenceon Multimodal Interfaces, 2002. Proceedings, 2002, pp. 529–534.

[27] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independentfacial expression recognition with robust face detection using a convo-lutional neural network,” Neural Networks: The Official Journal of theInternational Neural Network Society, vol. 16, no. 5, pp. 555–559, 2003.

[28] P. Zhao-yi, W. Zhi-qiang, and Z. Yu, “Application of mean shiftalgorithm in real-time facial expression recognition,” in InternationalSymposium on Computer Network and Multimedia Technology, 2009.CNMT 2009, 2009, pp. 1–4.

[29] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,“The extended cohn-kanade dataset (CK+): A complete dataset foraction unit and emotion-specified expression,” in 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition Work-shops (CVPRW), 2010, pp. 94–101.

[30] C.-D. Caleanu, “Face expression recognition: A brief overview of thelast decade,” in 2013 IEEE 8th International Symposium on AppliedComputational Intelligence and Informatics (SACI), 2013, pp. 157–161.

[31] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial ex-pressions with gabor wavelets,” in Third IEEE International Conferenceon Automatic Face and Gesture Recognition, 1998. Proceedings, 1998,pp. 200–205.

[32] M. Dahmane and J. Meunier, “Emotion recognition using dynamicgrid-based HoG features,” in 2011 IEEE International Conference onAutomatic Face Gesture Recognition and Workshops (FG 2011), 2011,pp. 884–888.

[33] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” vol. 86, no. 11, pp. 2278–2324, 1998.

[34] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional archi-tecture of monkey striate cortex,” The Journal of Physiology, vol. 195,no. 1, pp. 215–243, 1968.

[35] D. C. Cirean, U. Meier, J. Masci, L. M. Gambardella, and J. Schmid-huber, “Flexible, high performance convolutional neural networks forimage classification,” in Proceedings of the Twenty-Second InternationalJoint Conference on Artificial Intelligence - Volume Volume Two, ser.IJCAI’11. AAAI Press, 2011, pp. 1237–1242.

[36] P. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolu-tional neural networks applied to visual document analysis,” in SeventhInternational Conference on Document Analysis and Recognition, 2003.Proceedings, 2003, pp. 958–963.

[37] B. A. Wandell, Foundations of Vision, 1st ed. Sunderland, Mass: SinauerAssociates Inc, May 1995.

[38] S. Demyanov, J. Bailey, R. Kotagiri, and C. Leckie, “Invariant back-propagation: how to train a transformation-invariant neural network,”arXiv:1502.04434 [cs, stat], 2015.

[39] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. Metaxas, “Learningactive facial patches for expression analysis,” in 2012 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2562–2569.

[40] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model fitting byregularized landmark mean-shift,” International Journal of ComputerVision, vol. 91, no. 2, pp. 200–215, 2010.

a facial expression recognition system using convolutional...

Documents