facial expression recognition based on cnn · facial expression recognition is a task that we human...

Facial Expression Recognition based on CNN

Qian Liu Jiayang Wang{liuqian14, jy-wang14}@mails.tsinghua.edu.cn

The Department of Electronic Engineering, Tsinghua University

Abstract

Facial expression recognition has beenan active research area recently, and manykinds of methods have been proposed. Inthis project, we mainly used two main-stream Convolutional Neural Networks,AlexNet and GoogLeNet, to recognize hu-man facial expression and emotion. CNNsare capable of extracting powerful informa-tion about a facial image by using multiplelayers of feature detectors. Based on thetwo CNNs, some novel and helpful methodsare used in data preprocessing and modeloptimization. The recognition results showthat the CNNs used in this project has agood performance in terms of accuracy.

I. INTRODUCTION

Facial expression is one of the most helpful fea-tures in human emotion recognition. It was firstintroduced by Darwin in [1]. In [2], facial expres-sion was defined as the facial changes in responseto a persons emotional state.

Facial expression recognition is a task that wehuman all do in our daily life, but it cannot beeasily performed by the computers. With thefast innovation in computer vision, as well as thepopulation of machine learning and deep learn-ing, facial expression recognition is very poten-tial and has been very active for nearly 10 years.Many researchers have been devoted to this areaand quite a few methods are proposed. Nowa-days, facial expression recognition has varietiesof applications, such as interactive games, socia-ble robots and so on. This area is still potentialand full of vitality.

The project report is organized as follows. InSection II, some related work will be introduced,

mainly about different mainstream methods thatare used today. Then in Section III, we will havea brief introduction of Convolutional Neural Net-works and how it can be applied to facial ex-pression recognition. Section IV - V will take anopen dataset as an example to train the model ofrecognition. To be more specific, Section IV fo-cuses on the dataset and data preprocessing, andSection V is mainly about how we adopt CNNsto train the model and the methods we use to im-prove the accuracy on test set. In Section VI, theconclusion will be drawn and we will introduceour future work.

II. RELATED WORKS

Facial expression recognition has developed itsown methodology. As described in [2], facial ex-pression analysis consists of mainly three parts:face acquisition, feature extraction and represen-tation, and expression recognition. Some relatedworks are shown respectively below.

i. Face acquisition

Face acquisition is mainly split into two parts:face detection and head pose estimation. Facedetection is another field that used to be popular,and the methods have been very mature withlow error rate [3]. As for head pose estimation,it can use relevant transformation to adjust thelocation and angle of the face []. Therefore, faceacquisition is not hard to do.

ii. Feature extraction

Face acquisition mainly refers to the extrac-tion of facial changes caused by facial expres-sions. These changes can be extracted by us-

1

ing some methods based on geometric featuresand change of appearance. Geometric feature-based methods mainly utilize the shape and lo-cation of facial components like eyes, eyebrowsand mouth [4]. As for appearance-based meth-ods, it always works with features extracted fromthe whole face or some regions by using imagelters applied to the whole face image [5].

iii. Expression Recognition

Once feature extraction is finished, expressionrecognition can be performed. In [6], expressionrecognition is concluded as a three-step proce-dure: feature learning, feature selection and clas-sifier construction. In each step, there are lotsof related works. Feature learning is often com-bined with feature extraction, which prepare allthe features related to the expression. Featureselection is tougher. Since expression recogni-tion is often regarded as a classification prob-lem, it requires that the features should minimizethe intra-class variation as well as maximize theinter-class variation [7]. Finally, a classier or aset of classier are used to recognize the facial ex-pression, based on the selected features.

Whats more, thanks to the appearance of deeplearning, especially Convolutional Neural Net-works [8], one of the deep learning approaches,several facial expression recognition approacheshave been developed in the past decades with anincreasing progress in recognition performance.There is no doubt that the most powerful andpopular method has become Convolutional Neu-ral Network, which will be introduced in the nextsection.

III. CONVOLUTIONAL NEURALNETWORK

Convolutional Neural Network (CNN) wasfirst proposed by Lecun et al [9]. It has beenshown to be very effective in learning featureswhen using deeper architectures and new train-ing techniques. Generally, CNN includes con-volutional layers, sub-sampling layers and fullyconnected layers.

Convolutional layers are usually characterized

by the kernel’s size. Sub-sampling layers areused to increase the position invariance of thekernels. The main types of sub-sampling layersare maximum-pooling and average pooling [10].Fully connected layers are similar to the onesin general neural networks, its neurons are fullyconnected with the previous layer. The learn-ing procedure of CNNs consists of nding the bestsynapses weights (W). Supervised learning canbe performed using a gradient descent method.

CNN is the most popular technique that hasbeen successfully applied to the facial expressionrecognition problem. This technique also com-prises the three steps of facial expression recogni-tion (learning and selection of features and clas-sication) as stated in Section II in one single step.AlexNet and GoogLeNet are two of the popularCNNs in image classification. In this project, weuse these typical CNN to perform the facial ex-pression recognition.

IV. DATASET & PREPROCESSING

The dataset in this project is obtained fromKaggle. There was competition called Challengesin Representation Learning: Facial ExpressionRecognition Challenge.

The data consists of 48x48 pixel grayscaleimages of faces belonging to 7 different cate-gories (0=Angry, 1=Disgust, 2=Fear, 3=Happy,4=Sad, 5=Surprise, 6=Neutral). Some examplesof images are shown in figure 1.

Figure 1: Samples in the Dataset

The dataset is split into training set and testset, which contain 28,709 and 3,589 face imagesrespectively. And the distribution of differentcategories in training set is shown in the table1.

2

angry disgust fear happy3995 436 4097 7215

sad surprise neutral4830 3171 4965

Table 1: The distribution of 7 Categories

Briefly, the purpose and task is to categorizeeach face into one of seven categories based onthe emotion shown in the facial expression. Butbefore utilizing these images, we are supposed tofigure out if there are any preprocessing worksshould be done. Data processing is always atime-consuming work, but it is essential to dobecause it has a direct influence to the final per-formance. Luckily, the faces have been well pro-cessed. Firstly, the face images are more or lesscentered and occupies about the same amountof space in each image. Meanwhile, they are allnormalized so no more work should be done.

V. RECOGNITION BASED ON CNN

In this section, we used AlexNet andGoogLeNet respectively to perform facial expres-sion recognition, and got the optimized resultsby adopting several methods, including oversam-pling, SVM and so on. All the training andtest process is performed on the Deep LearningFramework Caffe.

It is worth noting that the process is not verysmooth sailing. At beginning, we chose to usethe CNN in [11], and the network structure isshown in figure 2. Unfortunately, it seems thatthe training process is always unable to converge.After trying a long time, we gave up on this net-work and turned to AlexNet and GoogLeNet.

Figure 2: The first attempted CNN

i. AlexNet

The network structure of AlexNet is shown infigure 3. AlexNet requires that the size of inputimages is 224224. When the images were resized,we use the training set to train the network.

Figure 3: AlexNet

A) Baseline

After training for the first time, the results areshown below. It seems the results were not goodenough.

Figure 4: Before fine-tuning

After fine-tuning, the results were not im-proved. But both of the two results are not over-fitting.

Figure 5: After fine-tuning

We utilize confusion matrix to help get the re-sults on test data set. The results are shown inthe table below. And the total accuracy is 48.6%.

Figure 6: Results on test set

3

B) Oversampling

It is obvious that the recognition rate (recall) isextremely low in terms of the second expression(disgust). It turns out that the training set isimbalanced. We think SVM may be helpful inthis situation. So we use the output probabilityvector as the eigenvectors of SVM, only to findit useless.

In order to solve this problem, we refer to [5]and perform oversampling. Here, we use theSmote algorithm to do oversampling. Among theclass with least samples, we find k neighbor sam-ples for each sample under a given criterion, e.g.Euclidean metric. Then we sample randomlyfrom the k neighbor samples and generate a newsample using interpolation of the chosen sampleand its chosen neighbors. By iteratively applyingSmote algorithm to the original dataset, we thenachieve a balanced dataset.

After random repetition, each of the expres-sion has approximately 7,000 samples. After-wards, the network is trained again. Thoughthe results on training set nearly remain un-changed, the recall on the test set is apparentlyimproved. But unfortunately, the problem ofoverfitting emerges.

Figure 7: After oversampling


C) Creating a New Label

To help prevent overfitting, we use anothermethod by splitting happy category into twogroups to create a new label. After training,then we combine happy and the new label to-gether. After doing this, the accuracy raises up

to 50.0%, and the recognition rate (recall) of dis-gust changes from 20.0% to 47.27%. The effecthas been well improved.


ii. GoogLeNet

GoogLeNet proposes the Inception Module,which can enhance the function of feature ex-traction. The structure is shown in figure 10.

Figure 10: GoogLeNet

We use the same method to cope with theimbalanced dataset, and train the model usingGoogLeNet. The results on the test set is verynice. The recognition rate (recall) is obviouslyhigher than the AlexNet. The accuracy raisesup to 61.6%.


V. CONCLUSION & FUTURE WORK

In this project, we mainly focus on trainingand tuning AlexNet and GoogLeNet, and copewith the imbalanced dataset by oversampling,

4

and solve the overfitting problem by create new label. The results are well improved compared to baseline by adopting these methods.

Though the effect of the trained model is fine, we think a lot of work still needs to be done. As for future work, we think more method to balance the dataset needs to be tried, and the network structure should be future optimized. Meanwhile, it remains to be seen whether there are any more powerful methods to extract the features.

REFERENCES

[1] Darwin, Charles. The Expression of theEmotions in Man and Animals. The Expres-sion of the emotions in man and animals.Cambridge University Press, 2009:692-696.

[2] Anstey, K. J., S. M. Hofer, and M.A. Luszcz. Handbook of face recognition.Springer New York, 2005.

[3] Zhang, Zhiwei, et al. ”Regularized Trans-fer Boosting for Face Detection AcrossSpectrum.” IEEE Signal Processing Letters19.3(2012):131-134.

[4] Zhang, Z., et al. ”Comparison BetweenGeometry-Based and Gabor-Wavelets-Based Facial Expression Recognition UsingMulti-Layer Perceptron.” IEEE Interna-tional Conference on Automatic Face andGesture Recognition, 1998. ProceedingsIEEE, 1998:454-459.

[5] Jain, S, C. Hu, and J. K. Aggarwal. ”Fa-cial expression recognition with temporalmodeling of shapes.” IEEE InternationalConference on Computer Vision WorkshopsIEEE, 2011:1642-1649.

[6] Liu, Ping, et al. ”Facial Expression Recog-nition via a Boosted Deep Belief Network.”IEEE Conference on Computer Vision andPattern Recognition IEEE Computer Soci-ety, 2014:1805-1812.

[7] Shan, Caifeng, S. Gong, and P. W. Mcowan.”Facial expression recognition based on

Local Binary Patterns: A comprehen-sive study.” Image & Vision Computing27.6(2009):803-816.

[8] Byeon, Young Hyen, and K. C. Kwak. ”Fa-cial Expression Recognition Using 3D Con-volutional Neural Network.” InternationalJournal of Advanced Computer Science &Applications 5.12(2014).

[9] Lcun, Yann, et al. ”Gradient-based learningapplied to document recognition.” Proceed-ings of the IEEE 86.11(1998):2278-2324.

[10] An, Dan C, et al. ”Flexible, high perfor-mance convolutional neural networks for im-age classification.” IJCAI 2011, Proceed-ings of the, International Joint Conferenceon Artificial Intelligence, Barcelona, Cat-alonia, Spain, July DBLP, 2011:1237-1242.

[11] Kim, Bo Kyeong, et al. ”Hierarchical com-mittee of deep convolutional neural net-works for robust facial expression recogni-tion.” Journal on Multimodal User Inter-faces 10.2(2016):173-189.

5

facial expression recognition based on cnn · facial expression recognition is a task that we human...

Documents