arxiv:2102.08983v1 [cs.cv] 17 feb 2021

5
Automated Detection of Equine Facial Action Units Zhenghong Li 12 Sofia Broom´ e 1 Pia Haubro Andersen 3 Hedvig Kjellstr ¨ om 14 1 KTH Royal Institute of Technology, Sweden 2 Stony Brook University, USA 3 Swedish University of Agricultural Sciences, Sweden 4 Silo AI, Sweden [email protected] [email protected] [email protected] [email protected] Abstract The recently developed Equine Facial Action Coding System (EquiFACS) provides a precise and exhaustive, but laborious, manual labelling method of facial action units of the horse. To automate parts of this process, we pro- pose a Deep Learning-based method to detect EquiFACS units automatically from images. We use a cascade frame- work; we firstly train several object detectors to detect the predefined Region-of-Interest (ROI), and secondly apply bi- nary classifiers for each action unit in related regions. We experiment with both regular CNNs and a more tailored model transferred from human facial action unit recogni- tion. Promising initial results are presented for nine action units in the eye and lower face regions. Code for the project is publically available at https : / / github . com / ZhenghLi/Automated-Detection-of-Equine- Facial-Action-Units. 1. Introduction The horse is a highly social species, and facial communi- cation is of utmost importance for the function of the herd. In accordance, the horse has a remarkable repertoire of fa- cial expressions which may be described by 17 degrees of freedom, so called actions units [19]. This repertoire is smaller than for humans which have 27 action units [5], but larger than for example the chimpanzee repertoire of 13 ac- tion units [3]. While the detailed analysis of the facial expressions of people to assess their emotions is mature [4], almost noth- ing is known about the association between facial activity and emotional states of animals. This is primarily due to the lack of self-report of emotions and other inner states in animals. Nevertheless, facial expressions are expected to convey important information of animal welfare [9], but methodologies for investigations are lacking. In the past few years, great progress has been made in the field of Computer Vision. With the adoption of Deep Learning models, such as Convolutional Neural Networks (CNN), in Computer Vision, in some tasks such as image classification, the accuracies of computer models are even competitive with human capabilities. Related works for hu- man facial action unit detection have also made progress in these years. Therefore, in this work, we investigate the possibility of automatically recognizing horse facial action units. We cur- rently focus on how to do this from still images. Even if the facial configurations of horses and humans are very differ- ent, a remarkably high number of action units are conserved across species [20]. We therefore transfer methods for hu- man action unit detection to horses. There are two main contributions of our project: • We propose a cascade framework for the recognition of horse facial action units. • We apply standard models for general image classifi- cation as well as for human facial action unit recogni- tion to horses within our framework and compare their performance across multiple experimental settings. 2. Related Work Facial expressions can be described as combinations of different facial action units. A facial action unit is based on the visible movement of a facial muscle lying under the skin [17]. In 1978, Ekman and Friesen proposed the Fa- cial Action Coding System (FACS) [5]. Through electri- cally stimulating individual muscles and learning to control them voluntarily, each action unit was associated with one or more facial muscles [2]. The recording of facial actions is entirely atheoretical; any inference of their meaning takes places during the later analysis. In 2002, Ekman et al. [6] proposed the final version of human FACS which since has been widely used for research in human emotion recogni- tion. Inspired by the progress of human FACS, Wathan et al. [19] created EquiFACS. As for FACS, EquiFACS con- sists of action units (AUs) and action descriptors (ADs). In 1 arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

Upload: others

Post on 09-Jun-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

Automated Detection of Equine Facial Action Units

Zhenghong Li12 Sofia Broome1 Pia Haubro Andersen3 Hedvig Kjellstrom14

1 KTH Royal Institute of Technology, Sweden 2 Stony Brook University, USA3 Swedish University of Agricultural Sciences, Sweden 4 Silo AI, Sweden

[email protected] [email protected]@slu.se [email protected]

Abstract

The recently developed Equine Facial Action CodingSystem (EquiFACS) provides a precise and exhaustive, butlaborious, manual labelling method of facial action unitsof the horse. To automate parts of this process, we pro-pose a Deep Learning-based method to detect EquiFACSunits automatically from images. We use a cascade frame-work; we firstly train several object detectors to detect thepredefined Region-of-Interest (ROI), and secondly apply bi-nary classifiers for each action unit in related regions. Weexperiment with both regular CNNs and a more tailoredmodel transferred from human facial action unit recogni-tion. Promising initial results are presented for nine actionunits in the eye and lower face regions. Code for the projectis publically available at https://github.com/ZhenghLi/Automated-Detection-of-Equine-Facial-Action-Units.

1. IntroductionThe horse is a highly social species, and facial communi-

cation is of utmost importance for the function of the herd.In accordance, the horse has a remarkable repertoire of fa-cial expressions which may be described by 17 degrees offreedom, so called actions units [19]. This repertoire issmaller than for humans which have 27 action units [5], butlarger than for example the chimpanzee repertoire of 13 ac-tion units [3].

While the detailed analysis of the facial expressions ofpeople to assess their emotions is mature [4], almost noth-ing is known about the association between facial activityand emotional states of animals. This is primarily due tothe lack of self-report of emotions and other inner statesin animals. Nevertheless, facial expressions are expectedto convey important information of animal welfare [9], butmethodologies for investigations are lacking.

In the past few years, great progress has been made inthe field of Computer Vision. With the adoption of Deep

Learning models, such as Convolutional Neural Networks(CNN), in Computer Vision, in some tasks such as imageclassification, the accuracies of computer models are evencompetitive with human capabilities. Related works for hu-man facial action unit detection have also made progress inthese years.

Therefore, in this work, we investigate the possibility ofautomatically recognizing horse facial action units. We cur-rently focus on how to do this from still images. Even if thefacial configurations of horses and humans are very differ-ent, a remarkably high number of action units are conservedacross species [20]. We therefore transfer methods for hu-man action unit detection to horses.

There are two main contributions of our project:

• We propose a cascade framework for the recognitionof horse facial action units.

• We apply standard models for general image classifi-cation as well as for human facial action unit recogni-tion to horses within our framework and compare theirperformance across multiple experimental settings.

2. Related WorkFacial expressions can be described as combinations of

different facial action units. A facial action unit is basedon the visible movement of a facial muscle lying under theskin [17]. In 1978, Ekman and Friesen proposed the Fa-cial Action Coding System (FACS) [5]. Through electri-cally stimulating individual muscles and learning to controlthem voluntarily, each action unit was associated with oneor more facial muscles [2]. The recording of facial actionsis entirely atheoretical; any inference of their meaning takesplaces during the later analysis. In 2002, Ekman et al. [6]proposed the final version of human FACS which since hasbeen widely used for research in human emotion recogni-tion.

Inspired by the progress of human FACS, Wathan etal. [19] created EquiFACS. As for FACS, EquiFACS con-sists of action units (AUs) and action descriptors (ADs). In

1

arX

iv:2

102.

0898

3v1

[cs

.CV

] 1

7 Fe

b 20

21

Page 2: arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

addition, the movements of the ears of horses are specif-ically named as ear action descriptors (EADs). Until re-cently, EquiFACS has not been used for research in animalemotions, due to the very time consuming manual labelling.An initial study of facial expressions of pain in horses [13]showed that pain indeed is associated with increased fre-quencies of certain facial AUs. These AUs were anatomi-cally located in the ear region, around the eyes, nostrils andmuzzle. A major limitation of that study was the small sam-ple size, which was attributed to the extremely resource de-manding, but necessary, hand labelling of the videos. A pre-requisite for more research in animal emotions using AUs istherefore development of methods that allow automated AUdetection.

Two approaches to pain recognition in animals via pre-defined facial AUs have been explored, one for sheep [12]and one for horses and donkeys [8]. Compared to ourmethod, they both rely on more coarse-grained underlyingfacial expression representations. The simpler structure in-creases robustness but limits the range and precision of ex-pressions that can be represented. A third approach is tolearn the underlying representation of pain expressions inhorses from raw data in an end-to-end manner [1], withoutimposing any designed coding system. In [1] , the authorsused a recurrent neural network structure that exploited bothtemporal and spatial information from video of horses, andfound that temporal information (video) is important forpain assessment. A future research direction is to study theinterplay between data-driven learned emotion expressionrepresentations and EquiFACS.

Ever since Krizhevsky et al. proposed AlexNet [10],deep CNNs have been replacing the traditional methodsin image classification fields with their outstanding perfor-mance. After AlexNet, deeper models such as VGG [18]and ResNet [7] have been proposed and applied as featureextractors in various fields. In our work, we chose CNNs asthe classifiers of AUs.

CNNs are also widely used for object detection. In thiswork, an object detector network is employed to detect pre-defined regions of interest (ROI). Object detectors can be di-vided into two categories: one-stage methods and two-stagemethods. One-stage methods such as YOLOv3 [14] andSSD [11] generate anchor proposals and perform detectionin one stage and can be trained in an end-to-end manner.Two-stage methods such as Faster-RCNN [15] first gener-ate anchor box proposals via a region proposal network andthen use ROI-Pooling to crop the related features out for thefinal prediction.

Previous works in human facial AU recognition fromstill images usually employ regional learning. Zhao etal. [21] inserted a region layer into a classical CNN to learnthe features of AUs on sparse facial regions, and trained themodel via multi-label learning. Shao et al. [17] further cas-

Table 1. Selected Action Units (Action Descriptors).

Code AU (AD) Labeled video clipsAD1 Eye white increase 394AD19 Tougue show 443AD38 Nostril dilator 696AU101 Inner brow raiser 1918AU145 Blink 3856AU25 Lips part 478AU47 Half blink 1816AU5 Upper lid raiser 208

AUH13 Nostril lift 353EAD101 Ears forward 4778EAD104 Ear rotator 5240

caded region layers with different sizes of patches togetherand employed an attention mechanism for AU detection.

3. DataIn total, the dataset used for this study contains 20180

labeled video clips across 31 AUs or ADs, with dura-tions ranging from 0.05 seconds to 2 minutes. The datais recorded across eight horse subjects. We randomly sam-ple one frame from each labeled clip to use as input for ourclassifier.

The class distribution is quite uneven. There are, e.g.,5280 labeled samples for EAD104 (ear rotator), but onlyone for AD160 (lower lip relax). For our experiments, weselected the 11 categories listed in Table 1. Each containsmore than 200 labeled crops, which we consider to be theminimal sufficient number of samples for the training, val-idation and test sets. However, we quickly found that theear action descriptors were not suited to detect using stillimages, since they are defined by movement. For this rea-son, we chose to exclude EAD101 and EAD104 from ourexperiments.

We perform subject-exclusive eight-fold validationacross the different horses, using six for training, one forvalidation and one for testing in each fold.

As for the sampled images, the original sizes are 1910×1080 or 1272× 720. For the face, eye and lower face crops,we first zero-pad the detected regions, to then resize them.Face crops are resized to 512×512, as they are then fed intoYOLOv3-tiny whose default input size is 416 × 416. Eyeand lower face crops are resized to 64× 64 for the modifiedDRML and modified AlexNet classifiers, which can run onsmaller input sizes.

4. MethodologyConsidering the class imbalance, the dataset is not suited

for multi-label classification. Initial tests were carried outin this fashion, but the model would get stuck in a local

2

Page 3: arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

Figure 1. A raw example of AD1 eye white increase in our dataset

Figure 2. Our cascade framework for EquiFACS recognition. Notethat each part is trained separately

minimum where it predicted the dominant AUs to be trueand the others to be false. Therefore, we use multiple binaryclassifiers for the nine classes. For each binary classifierset-up, we randomly sample negative samples to make thenumber of positive and negative samples equal. This wasdone for both the training, validation and test split of thedata.

Further, binary classification for facial AUs is a highlyfine-grained image classification task. As such, directly ap-plying networks for common image classification tasks willfail to reach acceptable results. Noticing that the horse faceis usually only a fraction of the raw frame (Fig. 1), and in-spired by the framework for sheep pain estimation by Luet al. [12], we propose our Deep Learning cascade frame-work (Fig. 2) for horse facial AU recognition. For eachinput image, we first detect the horse face and cut it out.Then, depending on the facial location of the action unitclass, we extract either the eye region or the lower face re-gion (including nostrils and mouth) from the detected faceregion. This is because the eye regions and the lower faceregions naturally take up even smaller fractions of the rawframes, and the detector network is not able to detect theseregions directly. Finally, two CNN-based models for imageclassification are used as binary classifiers for the respectiveclasses belonging to these regions. Note that the classifiersare trained separately for each class.

Table 2. Binary classification results for AU101, preceded by re-gion detection of different precision: whole frame, head region, ormost precisely, eye region. Mean and standard deviation resultingfrom eight-fold cross-validation.

Region DRML AlexNetAccuracy F1-score Accuracy F1-score

Frame 54.0±7.1 46.9±10.9 52.8±5.8 46.0±11.6Face 53.7±6.0 47.6±12.9 53.6±4.3 51.0±13.5Eye 58.1±4.8 60.7±6.9 57.0±6.5 58.0±10.7

4.1. ROI Detector

YOLOv3 [14] is a widely used object detector, withhigh performance with respect to both average precision andcomputation speed. For our task, we chose one of its light-weight implementations, YOLOv3-tiny for the ROI detec-tion, since it could be readily applied to our dataset withan acceptable performance. Knowing that existing objectdetectors do not perform well on objects that are small rela-tive to the frame, we chose to cascade the detectors togetherto detect small regions. Specifically, we first trained a facedetector, and then trained an eye region and a lower facedetector, respectively, on the detected face crops.

4.2. Classifier

DRML [21] is a classical model for human facial AUdetection. We first experimented with directly training iton raw frames as well as on detected face regions. Then,we applied it to detected eye or lower face regions, whilereplacing the region layer with a simple 3 × 3 convolutionlayer. In addition, we also applied the AlexNet [10] as aclassifier on all three levels of detail. When using DRMLand AlexNet for crops of the eye or lower face regions (re-sizing the input to a resolution of 64× 64), we modified thefirst convolutional layer in each model to use 5×5 convolu-tional kernels (instead of 11×11 as in the original models).

5. Experimental Results5.1. Model Exploration on AU101

First, we explore which frameworks are suitable forhorse facial AU recognition. We evaluate these on AU101(inner brow raiser), because the key features of AU101 (theangular shape of the inner brows) are relatively easy to rec-ognize in images and the class has more labeled samplesthan other relatively ”easy” classes. Using eight-fold val-idation, we evaluate the performance of the DRML modeland AlexNet on raw frames, detected face regions, and de-tected eye regions, in turn. Results are shown in Table 2.

For both DRML and AlexNet, we observe that there isno large difference between classification on the raw framesand on face crops. The results are merely random. We fur-ther employed Grad-CAM [16] to visualize what the criti-cal portions of the images were for these classifiers (Fig. 3).

3

Page 4: arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

Figure 3. Grad-CAM saliency maps of the models for binary clas-sification of AU101

Table 3. Binary Classification for Eight Other AUs. Eye regiondetection is used for AD1, AU145, AU47, and AU5, while lowerface region detection is used for AU25, AD19, AD38, and AUH13.

AU DRML AlexNetAccuracy F1-score Accuracy F1-score

AD1 65.2±6.0 64.0±10.4 64.1±6.2 60.8±8.2AU145 57.5±4.2 57.8±5.1 59.9±2.4 57.2±7.9AU47 49.6±3.7 50.7±9.1 49.4±2.3 47.9±8.3AU5 60.5±8.1 57.6±8.9 57.9±8.8 56.9±12.6

AU25 59.8±6.7 57.8±9.6 63.6±8.2 57.9±12.6AD19 64.6±5.0 59.6±8.9 61.8±5.0 58.0±8.1AD38 58.5±4.1 57.3±6.4 60.9±7.1 57.7±10.1

AUH13 58.6±2.7 53.2±6.4 60.0±4.3 56.1±9.1

According to the visualization results, both classifiers failedto focus on the relevant regions, i.e., the inner brows, bothfor the raw frames and face crops.

Our hope was that if we forced the classifiers to focusonly on the eye regions, they could learn something mean-ingful for the recognition of AU101. The last two columnsin Fig. 3 show that although the classifiers sometimes stilllook everywhere in the eye regions, they become able topay attention to the exact inner brows in some cases. Basedon these results, we believe that for the task at hand, it iscritical to give pre-defined ROIs as input to the classifiers.

5.2. Model Validation on Eight Other AUs

Based on the experiments on AU101, we carried out ex-periments on eight other AUs on the eye and lower faceregions to validate our framework. The results are shown inTable 3.

In these experiments, generally, the difference is notlarge between the performance of DRML and AlexNet, butthe DRML typically showed a more stable performanceacross the different subject folds than AlexNet. For mostAUs, the results lie close to the those on AU101, exceptfor AU47 (half blink). Moreover, AU47 is sometimes con-fused with AU145 (blink). We believe that this is becausethe difference between the presence or absence of AU47 istoo small in still images. Our framework would need to beextended to take sequences of images as input to to detectit, as in e.g. [1].

Similarly, we note that theoretically, we cannot distin-

Figure 4. Grad-CAM saliency maps of the different models forbinary classification of the other eight AUs. Examples with precisefocus are shown on the even columns

guish AU145 (blink) from AU143 (eye closure) in still im-ages because the sole difference between these is the du-ration of the time the eyes remain closed. However, sincethe AU143 class has too few samples in our dataset, we didnot include it in our experiments. Therefore, the bias of ourdataset causes this ”good” result of AU145.

To further validate our models, we visualized theirsaliency maps, shown in Fig. 4. Similar to AU101, the clas-sifiers are in many cases able to pay attention to the correctregions, such as eyelid, nostril, corner of mouth, and tongue(the even columns in Fig. 4), if we crop the pre-definedrelated regions out before training for classification.

6. ConclusionsIn this project, we proposed a framework for automated

detection of equine facial AUs and descriptors and showedthat our framework could help the classifiers to focus onmore relevant regions and to improve the classification ac-curacy.

There are many avenues to explore in the future. Firstly,because the dataset used in this article is quite small andunbalanced, deeper models such as VGG and ResNet can-not be trained well, and multi-label learning is not suitable.These techniques will be explored when we collect enoughdata. Secondly, we are aware that the attention of our modelis not fully stable, and we would like to add an attentionmechanism to the classification models to make our frame-work more effective. Finally, our framework currently doesnot work well for the EADs. This is probably due to themany possible positions of the ears, which are extremelymobile and rarely still in horses. EADs are therefore prob-ably best determined from video. This is also the case forthe blinking AUs (AU47 and AU145). A future direction istherefore to extend the method to the temporal domain.

AcknowledgmentsThe authors would like to thank Elin Hernlund, Katrina

Ask, and Maheen Rashid for valuable discussions. Thiswork has been funded by Vetenskapsradet and FORMAS.

4

Page 5: arXiv:2102.08983v1 [cs.CV] 17 Feb 2021

References[1] S. Broome, K. B. Gleerup, P. H. Andersen, and H. Kjell-

strom. Dynamics are important for the recognition of equinepain in video. In IEEE Conference on Computer Vision andPattern Recognition, 2019. 2, 4

[2] J. F. Cohn, Z. Ambadar, and P. Ekman. Observer-based mea-surement of facial expression with the Facial Action CodingSystem. The Handbook of Emotion Elicitation and Assess-ment, 1(3):203–221, 2007. 1

[3] R. Diogo, B. A. Wood, M. A. Aziz, and A. Burrows. On theorigin, homologies and evolution of primate facial muscles,with a particular focus on hominoids and a suggested uni-fying nomenclature for the facial muscles of the Mammalia.J. Anatomy, 215, 2009. 1

[4] P. Ekman. Facial expression and emotion. American Psy-chologist, 48(4):384–392, 1993. 1

[5] P. Ekman and W. V. Friesen. Facial Action Coding System.Consulting Psychologists Press, 1978. 1

[6] P. Ekman, W. V. Friesen, and J. C. Hager. Facial action cod-ing system [E-book]. Research Nexus, 2002. 1

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In IEEE Conference on ComputerVision and Pattern Recognition, 2016. 2

[8] H. I. Hummel, F. Pessanha, A. A. Salah, T. van Loon, andR. C. Veltkamp. Automatic pain detection on horse and don-key faces. In IEEE International Conf. Automatic Face andGesture Recognition, 2020. 2

[9] K. A. Descovich, J. Wathan, M. C. Leach, H. M. Buchanan-Smith, P. Flecknell., et al. Facial expression: An under-utilized tool for the assessment of welfare in mammals. AL-TEX, 34, 2017. 1

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNeural Information Processing Syst., 2012. 2, 3

[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. SSD: Single shot multibox detector. InEuropean Conf. Computer Vision, 2016. 2

[12] Y. Lu, M. Mahmoud, and P. Robinson. Estimating sheep painlevel using facial action unit detection. In IEEE Int. Conf. Au-tomatic Face and Gesture Rec., 2017. 2, 3

[13] M. Rashid, K. B. Gleerup, A. Silventoinen, and P. H. Ander-sen. Equine facial action coding system for determinationof pain-related facial responses in videos of horses. PLOSONE, accepted, 2020. 2

[14] J. Redmon and A. Farhadi. YOLOv3: An Incremental Im-provement. arXiv preprint arXiv:1804.02767, 2018. 2, 3

[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InNeural Information Proc. Syst. 2015. 2

[16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D.Parikh, and D. Batra. Grad-cam: Visual explanations fromdeep networks via gradient-based localization. InternationalJournal of Computer Vision, 128(2), 2019. 3

[17] Z. Shao, Z. Liu, J. Cai, Y. Wu, and L. Ma. Facial actionunit detection using attention and relation learning. IEEETransactions on Affective Computing, 2019. 1, 2

[18] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In InternationalConf. Learning Representations, 2015. 2

[19] J. Wathan, A. M. Burrows, B. M. Waller, and K. McComb.EquiFACS: the equine facial action coding system. PLOSONE, 10(8), 2015. 1

[20] A. C. Williams. Facial expression of pain: An evolutionaryaccount. Behav. Brain Sciences, 25, 2002. 1

[21] K. Zhao, W.-S. Chu, and H. Zhang. Deep region and multi-label learning for facial action unit detection. In IEEEConf. Computer Vision and Pattern Rec., 2016. 2, 3

5