ieeetransactionsonautonomousmentaldevelopment,vol.7,no.3 ...cvrl/publication/pdf/wang2015c.pdf ·...

12
IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015 189 Emotion Recognition with the Help of Privileged Information Shangfei Wang, Member, IEEE, Yachen Zhu, Lihua Yue,and Qiang Ji, Fellow, IEEE Abstract—In this article, we propose a novel approach to recog- nize emotions with the help of privileged information, which is only available during training, but not available during testing. Such additional information can be exploited during training to con- struct a better classifier. Specifically, we recognize audience’s emo- tion from EEG signals with the help of the stimulus videos, and tag videos’ emotions with the aid of electroencephalogram (EEG) sig- nals. First, frequency features are extracted from EEG signals and audio/visual features are extracted from video stimulus. Second, features are selected by statistical tests. Third, a new EEG feature space and a new video feature space are constructed simultane- ously using canonical correlation analysis (CCA). Finally, two sup- port vector machines (SVM) are trained on the new EEG and video feature spaces respectively. During emotion recognition from EEG, only EEG signals are available, and the SVM classifier obtained on EEG feature space is used; while for video emotion tagging, only video clips are available, and the SVM classifier constructed on video feature space is adopted. Experiments of EEG-based emo- tion recognition and emotion video tagging are conducted on three benchmark databases, demonstrating that video content, as the context, can improve the emotion recognition from EEG signals and EEG signals available during training can enhance emotion video tagging. Index Terms—Canonical correlation analysis (CCA), electro- encephalogram (EEG), emotion recognition, multimodal, privi- leged information, video tagging, videos. I. INTRODUCTION T HE ability to understand human emotions is desirable for human-computer interaction [1]. In the past decade, considerable amounts of research have been done on emotion recognition from voice, visual behavior, and physiological signals respectively or jointly. Very good progresses have been achieved in this field, and several commercial products have been developed, such as smile detection in camera [2]. Emotion is a subjective response to the outer stimulus. Most current Manuscript received October 17, 2014; revised February 02, 2015; accepted July 27, 2015. Date of publication July 30, 2015; date of current version November 04, 2015. This article was supported by the National Natural Science Foundation of China (Grant Nos. 61175037, 61228304, 61473270), and Natural Science Foundation of Anhui province (1508085SMF223).; date of current version November 04, 2015. S. Wang, L. Yue, and Y. Zhu are with the School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China (e-mail: [email protected]; [email protected]; [email protected]. cn). Q. Ji is with the Department of Electrical, Computer, and Systems Engi- neering, Rensselaer Polytechnic Institute. Troy, NY 12180-3590 USA (e-mail: [email protected]). Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TAMD.2015.2463113 research only employs the user’s response, which is charac- terized primarily by verbal and non-verbal behavior, such as speech, facial expressions, and physiological signals. Although the stimulus is one of the key factors during human’s emotion experience [3], little research explores the relation between stimuli and user’s responses to help emotion recognition. From the perspective of emotion recognition, the users’ emotional response, such as electroencephalograph (EEG), is available during both training and testing, and it is called as observed information. Features extracted from the stimulus, are often available during training, but is not available during testing. This kind of information is called privileged information [4], which may be exploited to find a better feature space or to construct better classifier for the observed information, and therefore improves the recognition performance using observed information. In this article, we recognize emotion from EEG signals, which are collected when subjects watch emotion-in- ducing videos. The stimuli videos are exploited as privileged information to help improve EEG-based emotion recognition. Recent years have seen a rapid increase in the size of video collections. Therefore, automatical video content analysis and annotation are needed in order to organize video collections ef- fectively and to assist users in finding videos quickly. Since emotion is an important component in the human classifica- tion and retrieval of digital videos, assigning emotional tags to videos has been an active research area in recent decades [5]. Current emotional video content analysis can be divided into two categories: direct and implicit approach. Direct emotional video content analysis infers the emotional content of videos di- rectly from the related audiovisual features, while implicit emo- tional video content analysis refers to detecting emotional con- tent from videos based on an automatic analysis of a user’s spon- taneous response while consuming the videos. Little research considers the relations between stimulus videos and users’ spon- taneous responses, which may help video emotion tagging. From the perspective of video emotion tagging, the video content is available during both training and testing. The audi- ence’s response, such as EEG signals, is often available during training, but is not available during testing. Therefore, in this article, we propose to perform video emotion tagging with the aid of EEG signals, which are collected when subjects watch the videos. In this article, we propose a novel approach to classify emo- tions from EEG signals, and a novel method to tag videos’ emotions. Instead of explicitly fusing multimodal features as often done by the existing methods, we propose to implicitly integrate them by using one modality as privileged information in training phase to help the emotion recognition by the other 1943-0604 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 28-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015 189

Emotion Recognition with the Help ofPrivileged Information

Shangfei Wang, Member, IEEE, Yachen Zhu, Lihua Yue, and Qiang Ji, Fellow, IEEE

Abstract—In this article, we propose a novel approach to recog-nize emotions with the help of privileged information, which is onlyavailable during training, but not available during testing. Suchadditional information can be exploited during training to con-struct a better classifier. Specifically, we recognize audience’s emo-tion from EEG signals with the help of the stimulus videos, and tagvideos’ emotions with the aid of electroencephalogram (EEG) sig-nals. First, frequency features are extracted from EEG signals andaudio/visual features are extracted from video stimulus. Second,features are selected by statistical tests. Third, a new EEG featurespace and a new video feature space are constructed simultane-ously using canonical correlation analysis (CCA). Finally, two sup-port vectormachines (SVM) are trained on the new EEG and videofeature spaces respectively. During emotion recognition fromEEG,only EEG signals are available, and the SVM classifier obtained onEEG feature space is used; while for video emotion tagging, onlyvideo clips are available, and the SVM classifier constructed onvideo feature space is adopted. Experiments of EEG-based emo-tion recognition and emotion video tagging are conducted on threebenchmark databases, demonstrating that video content, as thecontext, can improve the emotion recognition from EEG signalsand EEG signals available during training can enhance emotionvideo tagging.Index Terms—Canonical correlation analysis (CCA), electro-

encephalogram (EEG), emotion recognition, multimodal, privi-leged information, video tagging, videos.

I. INTRODUCTION

T HE ability to understand human emotions is desirablefor human-computer interaction [1]. In the past decade,

considerable amounts of research have been done on emotionrecognition from voice, visual behavior, and physiologicalsignals respectively or jointly. Very good progresses have beenachieved in this field, and several commercial products havebeen developed, such as smile detection in camera [2]. Emotionis a subjective response to the outer stimulus. Most current

Manuscript received October 17, 2014; revised February 02, 2015; acceptedJuly 27, 2015. Date of publication July 30, 2015; date of current versionNovember 04, 2015. This article was supported by the National NaturalScience Foundation of China (Grant Nos. 61175037, 61228304, 61473270),and Natural Science Foundation of Anhui province (1508085SMF223).; dateof current version November 04, 2015.S. Wang, L. Yue, and Y. Zhu are with the School of Computer Science and

Technology, University of Science and Technology of China, 230027, Hefei,China (e-mail: [email protected]; [email protected]; [email protected]).Q. Ji is with the Department of Electrical, Computer, and Systems Engi-

neering, Rensselaer Polytechnic Institute. Troy, NY 12180-3590 USA (e-mail:[email protected]).Color versions of one or more of the figures in this article are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TAMD.2015.2463113

research only employs the user’s response, which is charac-terized primarily by verbal and non-verbal behavior, such asspeech, facial expressions, and physiological signals. Althoughthe stimulus is one of the key factors during human’s emotionexperience [3], little research explores the relation betweenstimuli and user’s responses to help emotion recognition. Fromthe perspective of emotion recognition, the users’ emotionalresponse, such as electroencephalograph (EEG), is availableduring both training and testing, and it is called as observedinformation. Features extracted from the stimulus, are oftenavailable during training, but is not available during testing.This kind of information is called privileged information [4],which may be exploited to find a better feature space or toconstruct better classifier for the observed information, andtherefore improves the recognition performance using observedinformation. In this article, we recognize emotion from EEGsignals, which are collected when subjects watch emotion-in-ducing videos. The stimuli videos are exploited as privilegedinformation to help improve EEG-based emotion recognition.Recent years have seen a rapid increase in the size of video

collections. Therefore, automatical video content analysis andannotation are needed in order to organize video collections ef-fectively and to assist users in finding videos quickly. Sinceemotion is an important component in the human classifica-tion and retrieval of digital videos, assigning emotional tags tovideos has been an active research area in recent decades [5].Current emotional video content analysis can be divided intotwo categories: direct and implicit approach. Direct emotionalvideo content analysis infers the emotional content of videos di-rectly from the related audiovisual features, while implicit emo-tional video content analysis refers to detecting emotional con-tent from videos based on an automatic analysis of a user’s spon-taneous response while consuming the videos. Little researchconsiders the relations between stimulus videos and users’ spon-taneous responses, which may help video emotion tagging.From the perspective of video emotion tagging, the video

content is available during both training and testing. The audi-ence’s response, such as EEG signals, is often available duringtraining, but is not available during testing. Therefore, in thisarticle, we propose to perform video emotion tagging with theaid of EEG signals, which are collected when subjects watch thevideos.In this article, we propose a novel approach to classify emo-

tions from EEG signals, and a novel method to tag videos’emotions. Instead of explicitly fusing multimodal features asoften done by the existing methods, we propose to implicitlyintegrate them by using one modality as privileged informationin training phase to help the emotion recognition by the other

1943-0604 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

190 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

modality, which is referred to Learning Using Privileged Infor-mation (LUPI) [4].The definition of LUPI is as follows: given a set of triplets:

which are generated from a fixed but unknown probabilitydistribution , where the vector is referred toobserved information (i.e the features), and is the targetlabel, and is called privileged information that is availableonly during training. The goal is to find the parameters ofa function that leads to the smallest probabilityof incorrect classifications from a set of parameterswith the help of . In our method, we first extract five fre-quency features from each channel of the EEG signals, andseveral audio/visual features are extracted from video stimulus.Second, statistical tests are conducted to explore the relationsbetween emotional tags and EEG/video features. Third, a newEEG and a new video feature space are constructed simulta-neously using canonical correlation analysis (CCA). Finally,two support vector machines (SVM) are trained on the EEGand video feature space for EEG-based emotion recognitionand video emotion tagging respectively. Experimental resultson three benchmark databases demonstrate that: with the helpof the stimulus videos which are only available during training,our emotion recognition from EEG approach outperforms themethod using EEG signals only. Similarly, with the aid ofEEG signals during model training, our video emotion taggingmethod is better than that using video content only in mostcases.The outline of this article is organized as follows.

Section II gives a brief survey on emotion recognition andvideo tagging. In Section III, we propose a framework ofemotion recognition and introduced the method in each step.Section IV presents the experimental results on datasets from[6], MAHNOB-HCI and DEAP [7], as well the analyses andcomparison with current work. Section V gives the conclusion.

II. RELATED WORK

A. Emotion RecognitionAs we know, Humans interpret others’ emotion states from

multiplemodalities, such as facial expressions, speech, behaviorand posture. Thus, current research on emotion recognition alsoadopts multiple modalities, including face, speech, and physio-logical signals.Among these modalities, emotion recognition from face and

speech has been carried out most profoundly, since it is mucheasier and cheeper to record users’ face images and speechcompare with physiological signals. However, facial expres-sion and the pattern embodied in speech do not always reflectusers’ true emotions, specially when users want to concealtheir feelings. While physiological signals reflect unconsciouschanges in bodily functions, which are controlled by the sym-pathetic nervous system (SNS). Thus, physiological signals areintroduced to the field of emotion recognition recently. Earlyresearch mainly recognized deliberately displayed and exag-gerated emotions from single modality. Recently, increasing

works have focus on spontaneous emotion recognition frommultimodality. For emotion detection from single modality,current emotion recognition research is formulated as a super-vised learning problem. Thus, most works have focused ongood features and good classifier to get the mapping from thefeatures and labels. A detailed survey of emotion recognitioncan be find in [8], [9], [10].For multimodal emotion recognition, current works have

adopted both feature level [11], [12], [13], [14], [15] and deci-sion level fusion [16], [13], [14], [17], [18], [19], [20], [15]. Thecombined modality can be face-body [19], face-physiologicalsignal [21], face-speech [22], speech-physiological signal [23],multiple channel physiological signal [24], [25], [26], [27],[28], [29], [30], [31], [32] face-voice-body [33] and speech-text[34]. Most reported work achieved better performance by fea-ture level fusion [35]. A comprehensive overview can be foundin [36], [37], [38]. We refer decision-level and feature-levelfusion as explicit fusion, which uses multiple modalities duringboth training and testing.Although present emotion recognition have achieved great

progresses, it only employs users’ emotional responses, ig-noring emotion stimuli, which is closely related to the inducedemotion response and the detected emotion labels. The stimulimay provide further information for emotion recognitionfrom either one modality or multi-modality. For example,for analyzing the emotions from video-induced EEG signals,the stimulus—video, could provide important supplementaryinformation in building the video-based context modal foremotion recognition.

B. Video Emotional Tagging

To the best of our knowledge, study on emotional tagging ofvideos is first conducted at the beginning of the last decade byChitra Dorai [39], who defined the concept of computationalmedia aesthetics (CMA) as the algorithmic study to analyzeand interpret audiences’ emotional responses to media from themedia visual and aural elements based on the film grammar. Thecore trait of CMA is to tag media with the expected emotions,i.e., the emotions a film-maker intends to communicate to theparticular audiences with a common cultural background. Han-jalic [40] is the first to successfully related audiovisual featureswith the expected emotional dimension of the audience. Fol-lowing this philosophy, earlier research tried to infer the emo-tional content of videos directly from the related audiovisualfeatures. This kind of research is called direct approach of emo-tional video content analysis.In addition to visual and aural elements in videos, users’ spon-

taneous nonverbal responses while consuming the videos pro-vide clues of users’ actual emotions as induced by videos, andtherefore provide indirect characterization of the video’s emo-tion content. There has been research that relate to users’ spon-taneous nonverbal response to video’s emotion tag recently. Itis called implicit emotional video content analysis, which infersthe videos’ emotional content indirectly based on an analysisof a users’ spontaneous reactions while watching the video. Acomprehensive survey of video emotion tagging can be find in[41].

Page 3: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

WANG et al.: EMOTION RECOGNITION WITH THE HELP OF PRIVILEGED INFORMATION 191

Fig. 1. Framework of recognizing emotions with privileged information by exploiting relations between video content and users’ responses.

Thus, emotional tagging of videos involves videos’ content,users’ spontaneous responses, and users’ subjective evaluations(that is videos’ emotional tags). However, few researchers havefully explored the relationships among the three [6]. Soleymaniet al. [42] analyzed the relationship between subjects’ phys-iological response and video’s emotional tag in both arousaland valence, as well as the relationship between the content ofvideos and the emotional tag in arousal and valence. Wang et al.[6] proposed hybrid approaches to annotate videos in valenceand arousal space by using both users’ EEG signals and videocontent. While the induced physiological signals are availableduring training, it is not convenient to collect users’ physiolog-ical signals during application of emotion tagging. Therefore,in this article, we propose to use EEG signals as the privilegedinformation, which is only needed during model construction.Compared with related works, our contributions are as

follows:• We proposed a novel framework to recognize emotionsfrom EEG and assign emotion tags for videos by buildinga new feature space for one modality with the assistancefrom another modality.

• It’s the first time to obtain video features as privileged in-formation for emotion recognition from EEG [43], and useEEG features as privileged information for video emotiontagging. We regard the proposed method as implicit fu-sion, which use multiple modalities only during training,not testing.

III. METHODOLOGY

Fig. 1 shows the framework of our approach. Each colorrepresents the data flow of corresponding modality. There aretwo phases in the framework—learning and testing. In learningphase, there are four steps:1) EEG and video features are extracted.2) Features are selected by statistical tests to check whether

there exists significant difference in every feature betweentwo groups of emotions or tags.

3) Relationship between EEG signals and video content areexploited using CCA to co-construct a new EEG featurespace and a new video feature space.

4) For emotion recognition from EEG signals, a SVM clas-sifier is trained on the new EEG feature space; For video

emotion tagging, a SVM classifier is learned on the newvideo feature space.

In the testing phase, for emotion recognition from EEG sig-nals, the extracted EEG features are transferred to the EEG fea-ture space, and the SVM classifier trained on the EEG space isused to recognize emotions; For video tagging, video featuresare transferred to new video features, and the SVM classifierlearned on the video space is adopted to assign emotion tags tovideos.

A. Feature Extraction1) EEG Features: First, noise mitigation is carried out. Hor-

izontal electrooculography (HEOG) and vertical electrooculog-raphy (VEOG) are removed and a Butterworth bandpass filterwith a lower cutoff frequency of 0.3 Hz and a higher cutoff fre-quency of 45 Hz is used to remove dc drifts and suppress the50 Hz power line interference [44], [45]. Then the power spec-trum (PS) is calculated and is divided into five segments [46],which are the delta (0-4 Hz), theta (4-8 Hz), alpha (8-13 Hz),beta (13–30 Hz) and gamma (30–45 Hz) frequency bands. Theratio of the power in each frequency band to the overall poweris extracted as the feature.2) Visual-Audio Features: We extract both visual and audio

features from videos. For visual features, lighting, color andmo-tion are powerful tools to establish the mood of a scene andaffect the emotions of the viewer according to cinematographand psychology. Thus, three features, named lighting key, colorenergy and visual excitement are extracted from video clips.The details of features can be found in [47]. For audio fea-tures, thirty one commonly used audio features are extractedfrom videos, including average energy, average energy inten-sity, spectrum flux, zero crossing rate (ZCR), standard deviationof ZCR, 12 Mel-frequency cepstral coefficients (MFCCs), logenergy of MFCC, and the standard deviations of the above 13MFCCs [48], by using PRAAT (V5.2.35) [49].

B. Feature SelectionWe select features by exploring the relations between EEG

/ video features and users’ emotions with the approach in[6]. After feature extraction, we conduct statistical hypothesistesting to analyze whether there exists significant differencein every feature between the two groups of emotional tags.The null hypothesis H0 means the median difference between

Page 4: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

192 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

positive and negative valence (or high and low arousal) for afeature is zero. The alternative hypothesis H1 is that the mediandifference between positive and negative valence (or high andlow arousal) for a feature is not zero. We may reject the nullhypothesis when the P-value is less than the significant level.The procedures are described as follows: First, normality testis performed on each feature. If the feature is not normallydistributed, Kolmogorov-Smirnov test is used. Otherwise,homogeneity of the features is tested. If the variance is homo-geneous, T-test with homogeneity of variance is performed;otherwise, T-test with inhomogeneity of variance is performed.Features with P-values lower than a specific value are selected.In our study, P-value threshold is set to 0.05 to capture morefeatures that could represent the nature of EEG signals or videocontents.

C. Joint New EEG and Video Feature Spaces ConstructionAfter feature selection, we construct a new EEG feature space

with the help of video content and a new video feature spaceunder the assist of EEG signals using CCA. Instead of con-structing a new feature space independently for each modality(EEG or video), we propose to construct a new feature spacefor each modality with the help of the other modality by ex-ploiting their correlations. We believe such constructed featurespace outperforms the feature space constructed independentlyfor each modality. This idea is the essence of the proposed emo-tion analysis with privileged information.The main idea of CCA is to assess the relationship between

metric independent variables and multiple dependent measures.In our approach, denote the original training features of videosand EEG signals by and . The goal of CCA is to find and, and use them to construct canonical components and

(1)(2)

where and are corresponding canonical components whichhave the highest Pearson correlation coefficients, and the objec-tive function is

(3)

where is the empirical expectation. Here we use the approachdescribed in [50] to solve this problem. The relationship be-tween , , , are as follows:

(4)(5)

where . Thus, can be solved as the maxeigenvalue of and is the cor-responding eigenvector. In the similar way, is the eigenvectorof . With this calculation, therelationship between and is specified to or , whichcould be used for training and testing.

Moreover, the relationship between these two spaces couldalso be described as follows:

(6)

(7)

After exploiting this relation, we adopt it in training and testingphases. For emotion recognition from EEG, which means videofeatures are used as privileged information during trainingphase, is employed for training the model. And duringtesting, only the testing EEG features are available and novideo features are available. Transfer into the video-relatedspace by

(8)

Similar procedures are used in video tagging. When EEG fea-tures are used as privileged information, is used for trainingand video features will be transferred before testing by the fol-lowing equation:

(9)

D. Classifiers for Emotion AnalysesA SVM classifier is employed as follows:

(10)

where is for emotion recognition and for videotagging. During training, both EEG signals and video stimulusare available. Therefore, we obtain the EEG feature space using

of (2), and video feature space using of (1). During testing,only one modality is available, we obtain the EEG feature spaceusing in (8) for emotion recognition, and in (9) for videotagging.

IV. RESULTS

A. Dataset PreparationTo evaluate the performance of our approach, we conducted

experiments on three benchmark databases: the USTC-ERVS(EEG Response to Video Stimulus) [6], the MAHNOB-HCI[51] and the DEAP [7]. Each of them collects EEG signals from32-channels with a 10–20 system [52]. Since each dataset wascollected under different conditions, with different emotionalratings, below we discuss the procedure we followed to preparethe datasets for this work.Specifically, the USTC-ERVS dataset contains 197 EEG re-

sponses to 92 video stimulus from 28 users. Users’ emotionalself-assessment are five-scale evaluations (i.e., -2, -1, 0, 1, 2)

Page 5: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

WANG et al.: EMOTION RECOGNITION WITH THE HELP OF PRIVILEGED INFORMATION 193

for both valence and arousal. For this work, we divide theminto two groups, based on whether their self-assessed values arehigher than zero or not. As a result, in arousal space, 149 EEGrecordings are classified as high, and 48 recordings as low; 70videos are high and 22 are low. Similarly, in valence space, 77EEG recordings are classified as positive, and 120 as negative;30 video clips are as positive and 62 are negative.The MAHNOB-HCI is a multimodal database for emotion

recognition and implicit tagging. It includes the physiologicalsignals from 27 participants in response to 20 stimulus videos.Subjects’ emotional self-assessment are nine-scale evaluations,from 1 to 9, for both valence and arousal. For this work, weredefine the ratings as positive or high if they are larger than fiveand define them as negative or low otherwise. Seven samples areremoved because they don’t have a corresponding media file orgaze data. Thus, we obtain 533 EEG segments correspondingto the 20 stimulus videos. For valence, there are 289 positiveand 244 negative EEG segments, as well as seven positive and13 negative videos. For arousal, there are 268 high and 265 lowEEG segments as well as 10 high and 10 low video stimulus.The DEAP database is a multimodal database that contains

EEG and peripheral physiological signals from 32 participantsin response to 40 stimulus music videos. Subjects’ emotionalself-assessments are nine-scale evaluations, from 1 to 9, for bothvalence and arousal. For this work, we redefine the ratings aspositive or high if they are larger than 5, otherwise, they are neg-ative or low. As a result, we obtain 1216 EEG segments corre-sponding to 38 stimulus videos. Two videos (experiment ID: 17and 18) cannot be downloaded from Youtube due to copyrightissues. For valence, our classification produces 672 positive and544 negative EEG segments, as well as 18 positive and 20 neg-ative videos. For arousal, there are 726 high and 490 low EEGsegments as well as 25 high and 13 low video stimulus.Because the number of users and videos for different

databases are different, experiments conducted on differentdatabases have different number of users and videos. In addi-tion, we used all available samples in the databases in order tocompare with the related work. Finally, for the experiments, thegroundtruth emotion labels of EEG signals come from subjects’evaluations, and the groundtruth emotion tags for videos arethe average of the corresponding annotations.

B. Experimental Results and Analyses

To evaluate the proposed research, we conducted fourexperiments. First, we evaluated the effectiveness of the pro-posed feature selection methods, whereby we identify theEEG channels and video features that are most effective foremotion recognition and video emotion tagging. Second, weevaluated the performance of the proposed method for emotionrecognition from EEG signals against the baseline method.Third, we evaluated the performance of the proposed methodfor video emotion tagging against the baseline method. Finally,we compared our methods with state of the art methods thatperform explicit multimodal fusion. For all experiments, weemployed the leave-one-video out cross-validation protocol.During training, for each fold, a model selection procedure is

carried out to tune the hyper-parameters of the best candidatemodel. Below, we present results for these experiments.1) Feature Evaluation Experiments: In this section, we

discuss the experimental results to identify effective EEG andvideo features.We first present results for EEG feature analysis. Fig. 2 shows

the frequency distribution on humans’ brain of the selected EEGfeatures in valence and arousal dimensions. The effectiveness ofan EEG electrode is measured by its EEG feature selection fre-quency. The EEG feature selection frequency of an electrode isthe ratio of the number of selected features on this electrode tothe number of all selected features for all folds of the cross-val-idation. Colors from red to blue correspond to frequencies fromhigh (20%) to low (0%). Fig. 2(a), Fig. 2(c) and Fig. 2(e) showthe distribution of the selected features in valence space. We ob-serve that the selected EEG features are mainly concentrated inthe frontal area of the brain, which suggests that the frontal areais highly correlated with the valence aspect of the humans’ emo-tion. Fig. 2(b), Fig. 2(d) and Fig. 2(f) illustrate the area that areactive for arousal response, from which we can conclude thatfeatures mainly gather in occipital area, which is highly rele-vant to the excitement (arousal) of humans’ emotions [53].For video features (audio and visual features), we also use

feature selection frequencies in all folds of cross-validation tomeasure the feature’s effectiveness. For example, if a feature isselected four times in a 10-fold cross-validation, its frequency is40%. Fig. 3 shows the frequencies of selected video features onall the datasets. Each horizontal coordinate represents a videofeature, and vertical coordinates represents corresponding se-lected frequencies. Fig. 3(a), Fig. 3(c) and Fig. 3(e) show theselected feature frequencies in valence space. For visual fea-tures, even though only three of them are extracted, two ofthem are selected, which suggests visual stimulus strongly in-fluence valence aspect of emotions [54]. For audio features,MFCC features are discriminative [54], since a majority of themare among the high selection frequencies features. For arousalspace, Fig. 3(b), Fig. 3(d) and Fig. 3(f) display the features thatare highly related to the excitement of emotion. It is clear thatthe frequencies of the last video feature -visual excitement, islower than the other two visual features, since it is not selectedon the USTC-ERVS and the DEAP, and its frequency on theMAHNOB-HCI is only about 50%. However, color energy andlighting key are both selected on the MAHNOB-HCI and theDEAPwith high frequencies.We can therefore conclude that thecolor contrast and brightness are significant in arousing human’semotions.In summary, through the feature evaluation experiments, we

can conclude that EEG channels located on the frontal part ofthe brain are effective in characterizing the valence of the emo-tion, while EEG channels located on occipital part of the brainare effective for characterizing the arousal aspect of the emo-tion. These experiments further show that visual features suchas color contrast and brightness are effective for characterizingboth valence and arousal, while audio MFCC audio features areeffective for characterizing the valence. These findings are con-sistent with prior studies [54].2) Emotion Recognition Experiments: To evaluate the per-

formance of our approach for emotion recognition, we compare

Page 6: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

194 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

Fig. 2. Distribution of selected EEG features on different databases. (a) Valence on USTC-ERVS. (b) Arousal on USTC-ERVS. (c) Valence on MAHNOB-HCI.(d) Arousal on MAHNOB-HCI. (e) Valence on DEAP. (f) Arousal on DEAP.

emotion recognition from EEG signals only and emotion recog-nition from EEG signals with the help of video content as privi-leged information. The former recognizes emotions from the se-lected EEG features as described in Section III-A1, and the latterrecognizes emotions from the new EEG feature space, which isconstructed with the help of video content using CCA as dis-cussed in Section III-C.Table I summarizes the experimental results for emotion

recognition from EEG signals. To evaluate the performanceof our methods under different conditions and from differentperspectives, we employ two different performance criteria:recognition accuracy and F1-score. Recognition accuracymeasures overall classification accuracy without consideringperformance for each class, while F1-score is a compositescore that combines classifier performance for each class. Forunbalanced test data, F1 score is a better measure. From Table I,we can obtain the following observations:

1) Our proposed EEG-based emotion recognition approachusing stimuli videos as privileged information outper-forms the baseline method that uses only EEG signalsfor both valence and arousal recognition. Specifically, forvalence recognition, our method outperforms the base-line method in terms of both recognition accuracy andF1-score on all three databases. For arousal recognition,the proposed approach improves the F1-score on all thethree databases and improves recognition accuracy forone dataset.

2) Our proposed emotion recognition approach works betterfor valence recognition than for arousal recognition, sinceit improves both performance metrics for all the threedatabases for valence recognition.

3) For arousal recognition, incorporating videos duringtraining improves the recognition of low arousal videosover the baseline model.

Page 7: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

WANG et al.: EMOTION RECOGNITION WITH THE HELP OF PRIVILEGED INFORMATION 195

Fig. 3. Video feature selection frequencies on different databases.(a) Valence on USTC-ERVS. (b) Arousal on USTC-ERVS. (c) Valence on MAHNOB-HCI.(d) Arousal on MAHNOB-HCI. (e) Valence on DEAP. (f) Arousal on DEAP.

3) Emotion Tagging Experiments: To evaluate the perfor-mance of the proposed method for video emotion tagging,we follow the same experimental protocol as that for emotionrecognition. Specifically, we compare emotion tagging usingvideo content only and emotion tagging using video content,with the help of EEG signals. The former uses only videofeatures, while the latter uses the video features, identified withthe help of EEG signals as the privileged information.Table II shows the experimental results of video emotion

tagging. Similarly, we employed both recognition accuracyand F1-score criteria for the evaluation. From Table II, wecan find that using F1-score, our proposed approach usingEEG signals as privileged information outperforms the base-line method that does not use the EEG signals in four outof six cases, i.e., valence recognition on the MAHNOB-HCIdatabase and the DEAP database, as well as arousal recog-nition on the USTC-ERVS database and the MAHNOB-HCI

database. These results demonstrate the effectiveness of ourmethod.For valence tagging on the USTC-ERVS database and

arousal tagging on the DEAP database, incorporating EEGsignals during training actually hurt the video emotion taggingperformance. We believe this is caused by large EEG responsevariation and large video emotion rating variation across sub-jects for the same videos. In other words, for certain videos,users’ EEG responses and their ratings of the emotion for thesevideos vary significantly from user to user. These large betweensubject variations produce inconsistent EEG signals, whichlead to poor performance in video emotion tagging since videoemotion tags are, in general, person independent.To evaluate this thinking, we performed a comparative

experiment on the DEAP database, since each video of theDEAP database was watched by the same 32 subjects. We firstcompared corresponding subjects’ EEG features and subjects’

Page 8: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

196 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

TABLE IEMOTION RECOGNITION RESULTS FROM EEG

TABLE IIVIDEO TAGGING RESULTS FROM VIDEO CONTENTS

evaluations of the worst and best tagged videos in arousal.“worst” means the video with the maximum misclassificationerror, while the “best” means the video with the minimummisclassification error.Fig. 4 compares the standard deviations of the selected EEG

features from 32 subjects while watching the best tagged video(i.e., video 2) and the worst tagged video (i.e., video 21) inarousal on the DEAP dataset, where horizontal axis representsselected features and vertical axis represents their standard de-viation values. It is clear that the standard deviation of most se-lected EEG features corresponding to video 2 (best video) arelower than that corresponding to video 21 (worst video). Thissuggests that the subjects’ physiological responses to video 21vary more from person to person than for video 2. This subjectvariation will also lead to variation or even contradiction in sub-jects’ ratings of a video’s emotional content.Here we use the standard deviation of subjects’ EEG fea-

tures because it represents the feature fluctuation among sub-jects. Since all the features are uniformly distributed in [0,1],higher standard deviation means higher uncertainty or diversityamong subjects. While, mean feature values show expectations

Fig. 4. Standard deviations of selected EEG features for the best and worsttagged videos on arousal of DEAP.

in feature values, higher or lower mean values cannot capturesubject-specific information in the features.To test whether there is significant difference between users’

EEG pattern to the best tagged video and that to the worst tagged

Page 9: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

WANG et al.: EMOTION RECOGNITION WITH THE HELP OF PRIVILEGED INFORMATION 197

Fig. 5. Users’ evaluations for the best and worst tagged videos on arousal ofDEAP.

video, we performed a paired two-tailed -test on the standarddeviations of EEG features selected for both the best taggedand worst tagged videos. A feature’s standard deviations cor-responding to “worst” and “best” compose a test pair. -test isformulated as (11):

(11)

where and are standard deviations of the th EEG featurein the intersection of the worst tagged and the best tagged videosrespectively. and are the standard deviation and mean ofrespectively. Two-tailed test could capture statistical signifi-

cance on both directions of the distribution. The -test result is, which is less than 0.01. This means the standard

deviations of EEG features are significantly different betweenthe best tagged and the worst tagged videos.Besides variation in EEG responses, user’s ratings of the

video’s emotional content also vary with user. Fig. 5 comparesmeans and standard deviations of users’ evaluations betweenvideo 21 and 2. It could be discovered that video 21’s averageevaluation is closer to the dividing value (i.e., 5) of high / lowarousal annotations than that of video 2. Since its standarddeviation is higher, which means subjects’ arousal feelingson video 21 is significantly individualized. Some of them feelaroused but others have the opposite feelings. Such variationmay prevent EEG from assisting emotion tagging in arousalspace on the DEAP database.For further analyses, we compared the corresponding EEG

features and subjects’ evaluations of the worst and best taggedvideos in valence on the DEAP dataset. Fig. 6 shows thecomparison of EEG feature between the worst and best taggedvideos in valence. We can find that the standard deviation ofmost selected EEG features correspond to video 6 (best video)are close to those correspond to video 10 (worst video). Thep-value of a paired two-tailed -test result is 0.29, which isgreater than 0.05. This means the difference between the EEGfeatures’ standard deviations induced by the two videos is notsignificant. It means that the subjects’ physiological response tovideo 10 and 6 has the same trend within their own EEG sam-ples. Fig. 7 shows means and standard deviations of subjects’

Fig. 6. Standard deviations of selected EEG features for the best and worsttagged video on valence of DEAP.

Fig. 7. Users’ evaluations for the best and worst tagged videos on valence ofDEAP.

evaluation to video 10 and 6. Comparing with Fig. 5, it could bediscovered that the standard deviations for both worst and bestvideos are lower than the corresponding values in Fig. 5. Thismeans subjects’ evaluations are more accordant on valence andsubjects’ response has less individuality than that on arousal.Based on the above analysis, we can conclude that if the priv-

ileged information such as the EEG features is subject-specificand context-dependent for a specific video, it will hurt the emo-tion tagging performance. Our approach will be more effectivewhen subjects’ physiological responses and evaluations are con-sistent with each other.4) Comparison with Related Works: Currently, little work

exploits the relations between users’ emotional responses andstimulus for users’ emotion recognition and video emotion tag-ging. Soleymani et al. [42] analyzed the relationship betweensubjects’ physiological response and video’s emotional tag inboth arousal and valence. However, they did not use such rela-tions for video tagging or emotion recognition. Wang et al. [6]proposed hybrid approaches to annotate videos in valence andarousal space by using both users’ EEG signals and video con-tent. Since they adopted the individual emotion evaluations asthe label, we only compare with their results with ours for emo-tion recognition, as shown in Table III.Comparing Table III with Table I, we can find that for valence

recognition, our method performs slightly better than the depen-

Page 10: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

198 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

TABLE IIIEMOTION RECOGNITION RESULTS BY FUSING MODALITIES IN [6] ON USTC-ERVS

dent feature-level fusion, which also builds the relationship be-tween EEG and video features.We obtain the same performancein terms of recognition accuracy but 0.0082 higher in F1-score.Even though the improvement is marginal, our method is stillbetter in that it uses much less information (only EEG signalsduring testing) than the fusion methods that uses both EEG sig-nals and video features during testing.In addition, for arousal recognition, our method performs

better than the decision-level fusion and dependent fea-ture-level fusion with 0.0140/0.0828 higher in F1-score. Thedecision-level fusion and dependent feature-level fusion pro-posed in [6] use EEG signals and video content during bothtraining and testing. In contrast, our method uses video duringtraining only. It further demonstrates our method can recognizeemotion effectively by employing stimulus video as privilegedinformation.We also tried to compare our work with the baseline results

from [51] and [7]. However, due to using different sampledivision strategies and missing videos, we cannot perform a faircomparison in this article. Specifically, for the MAHNOB-HCI,ref [51] divided valence/arousal space by emotion categoriesinstead of by self-assessment values as ours. For example,“excited”, which means high arousal, consists of surprise, fear,anger and anxiety. Thus we cannot compared with it [51]. Forthe DEAP database, samples corresponding to two videos (ex-periment ID: 17 and 18) cannot be downloaded from Youtubedue to copyright issues. Thus, we cannot compare with ref [7]either in this article.

V. CONCLUSIONIn this article, we propose novel multimodal methods that ex-

ploit the privileged information to improve both emotion recog-nition and video tagging methods. Specifically, we first performa CCA analysis to construct new feature spaces for EEG andvideo features by exploiting their relationships. Based on thisnew feature space, we then employ the stimulus videos as privi-leged information for EEG-based emotion recognition and em-ploy EEG signals as privileged information for video emotiontagging.

The experimental results on three benchmark databases showthat by implicitly using one modality as privileged information,we can improve the performance on the other modality duringtesting. For EEG-based emotion recognition, our proposed emo-tion recognition approach using stimuli videos as privileged in-formation outperforms the baseline method for both valenceand arousal recognition. Specifically, for valence recognition,stimuli videos improve both recognition accuracy and F1-scoreon all three databases. For arousal recognition, stimuli videosimprove F1-score on all three databases and improve recogni-tion accuracy on one dataset. These experiments demonstratethat exploiting stimuli context as privileged information duringtraining will benefit the EEG-based emotion recognition.For video emotion tagging, our proposed approach using

EEG signals as privileged information outperforms the baselinemethod without the help of EEG signals for most cases, demon-strating that exploiting the EEG signals during training can ingeneral improve video emotion tagging. But our experimentsalso reveal that for videos that cause large variation amongsubjects in their EEG responses and in their emotion ratings,exploiting EEGs signals as privileged information may actuallyhurt video emotion tagging.Furthermore, the comparison with related work shows that

our proposed implicit fusion achieves comparable or even betterperformance than the existing methods based on explicit fusionof the two modalities.

REFERENCES[1] N. Fragopanagos and J. G. Taylor, “Emotion recognition in humanc-

computer interaction,”Neural Netw., vol. 18, no. 4, pp. 389–405, 2005,Emotion and Brain.

[2] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, and J. Movellan, “To-ward practical smile detection,” IEEE Trans. Pattern Anal. MachineIntell., vol. 31, no. 11, pp. 2106–2111, 2009.

[3] J. E. LeDoux, “Cognitive-emotional interactions in the brain,” Cogni.Emotion, vol. 3, no. 4, pp. 267–289, 1989.

[4] V. Vapnik and A. Vashist, “A new learning paradigm: Learning usingprivileged information,” Neural Netw., vol. 22, no. 5, pp. 544–557,2009.

[5] S. Wang and X. Wang, “Emotional semantic detection from multi-media: A brief overview,” Kansei Eng. Soft Computing: Theory andPractice 2010, pp. 126–146, 2010.

Page 11: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

WANG et al.: EMOTION RECOGNITION WITH THE HELP OF PRIVILEGED INFORMATION 199

[6] S. Wang, Y. Zhu, G. Wu, and Q. Ji, “Hybrid video emotional taggingusing users eeg and video content,” Multimedia Tools Applicat., pp.1–27, 2013.

[7] K. Sander, M. Christian, S. Mohammad, J.-S. Lee, Y. Ashkan, E.Touradj, P. Thierry, N. Anton, and P. Ioannis, “Deap: A database foremotion analysis; using physiological signals,” IEEE Trans. AffectiveComput., vol. 3, no. 1, pp. 18–31, Jan.–Mar. 2012.

[8] M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotionrecognition: Features, classification schemes, and databases,” PatternRecogn., vol. 44, no. 3, pp. 572–587, 2011.

[9] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey ofaffect recognition methods: Audio, visual, and spontaneous expres-sions,” IEEE Trans. Pattern Anal. Machine Intell., vol. 31, no. 1, pp.39–58, 2009..

[10] H. Gunes, B. Schuller, M. Pantic, and R. Cowie, “Emotion represen-tation, analysis and synthesis in continuous space: A survey,” in Proc.IEEE Int. Conf. Automat. Face Gesture Recogn. Workshops (FG 2011)2011, 2011, pp. 827–834, IEEE.

[11] S. K. D’Mello and A. Graesser, “Multimodal semiautomated affectdetection from conversational cues, gross body language, and facialfeatures,” User Model. User-Adapted Interact., vol. 20, pp. 147–187,2010.

[12] G. Caridakis, K. Karpouzis, and S. Kollias, “User and context adaptiveneural networks for emotion recognition,” Neurocomputing, vol. 71,no. 13C15, pp. 2553–2562, 2008.

[13] G. Castellano, L. Kessous, and G. Caridakis, “Emotion recognitionthrough multiple modalities: Face, body gesture, speech,” in Affectand Emotion in Human-Computer Interaction, volume 4868 of Lec-ture Notes in Computer Science, C. Peter and R. Beale, Eds. Berlin,Germany: Springer, 2008, pp. 92–103.

[14] J. Kim, “Bimodal emotion recognition using speech and physiologicalchanges,” Robust Speech Recogn. Understand., pp. 265–280, 2007.

[15] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh,S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recogni-tion using facial expressions, speech and multimodal information,” inProc. 6th Int. Conf. Multimodal Interfaces, ICMI’04, New York, NY,USA, 2004, pp. 205–211, ACM.

[16] I. Arroyo, D. G. Cooper, W. Burleson, B. P. Woolf, K. Muldner, andR. Christopherson, “Emotion sensors go to school,” in Proc. 2009Conf. Artif. Intell. Education, Brighton, U.K., July 6–10, 2009, pp.17–24.

[17] A. Kapoor, W. Burleson, and R. W. Picard, “Automatic predictionof frustration,” Int.J. Human-Computer Studies, vol. 65, no. 8, pp.724–736, 2007.

[18] K. Karpouzis, G. Caridakis, L. Kessous, N. Amir, A. Raouzaiou, L.Malatesta, and S. Kollias, “Modeling naturalistic affective states viafacial, vocal, and bodily expressions recognition,” in Artifical Intelli-gence for Human Computing, volume 4451 of Lecture Notes in Com-puter Science, T. S. Huang, A. Nijholt, M. Pantic, and A. Pentland,Eds. Berlin, Germany: Springer, 2007, pp. 91–112.

[19] A. Kapoor and R. W. Picard, “Multimodal affect recognition inlearning environments,” in Proc. 13th Annu. ACM Int. Conf. Multi-media, MULTIMEDIA’05, New York, NY, USA, 2005, pp. 677–682,ACM.

[20] R. Kaliouby and P. Robinson, “Generalization of a visionbased com-putational model of mind-reading,” in Affective Computing and Intel-ligent Interaction, volume 3784 of Lecture Notes in Computer Science,J. Tao, T. Tan, and R. W. Picard, Eds. Berlin, Germany: Springer,2005, pp. 582–589.

[21] J. N. Bailenson, E. D. Pontikakis, I. B. Mauss, J. J. Gross, M. E. Jabon,C. A. C. Hutcherson, C. Nass, and O. John, “Real-time classificationof evoked emotions using facial feature tracking and physiological re-sponses,” Int. J. Human-Computer Studies, vol. 66, no. 5, pp. 303–317,2008.

[22] M. Mansoorizadeh and N. M. Charkari, “Multimodal information fu-sion application to human emotion recognition from face and speech,”Multimedia Tools Applicat., vol. 49, pp. 277–297, 2010.

[23] J. Kim and E. Andr, “Emotion recognition using physiological andspeech signal in short-term observation,” in Perception and InteractiveTechnologies, volume 4021 of Lecture Notes in Computer Science.Berlin, Germany: Springer, 2006, pp. 53–64.

[24] R. A. Calvo, I. Brown, and S. Scheding, “Effect of experimental fac-tors on the recognition of affective mental states through physiolog-ical measures,” in AI 2009: Advances in Artificial Intelligence, volume5866 of Lecture Notes in Computer Science, A. Nicholson and X. Li,Eds. Berlin, Germany: Springer, 2009, pp. 62–70.

[25] O. AlZoubi, R. A. Calvo, and R. H. Stevens, “Classification of eegfor affect recognition: An adaptive approach,” in AI 2009: Advancesin Artificial Intelligence, 5866 of Lecture Notes in Computer Science,A. Nicholson and X. Li, Eds. Berlin, Germany: Springer, 2009, pp.52–61.

[26] C. Liu, K. Conn, N. Sarkar, and W. Stone, “Physiology-based affectrecognition for computerassisted intervention of children with autismspectrum disorder,” Int. J. Human-Computer Studies, vol. 66, no. 9, pp.662–677, 2008.

[27] H. Alicia and F. Claude, “Predicting the three major dimensions of thelearner’s emotions from brainwaves,”World Acad. Sci., Eng. Technol.,vol. 25, pp. 323–329, 2007.

[28] V. Olivier and L. Christine, “A user-modeling approach to build user’spsycho-physiological maps of emotions using bio-sensors,” in The 15thIEEE Int. Symp. Robot Human Interact. Commun., 2006 ROMAN2006,Sep. 2006, pp. 269–276.

[29] J. Wagner, J. Kim, and E. André, “From physiological signals to emo-tions: Implementing and comparing selected methods for feature ex-traction and classification,” in Proc. IEEE Int. Conf. Multimedia Expo,ICME 2005, Jul. 2005, pp. 940–943.

[30] K. H. Kim, S. W. Bang, and S. R. Kim, “Emotion recognition systemusing short-term monitoring of physiological signals,” Med. Biologic.Eng. Comput., vol. 42, pp. 419–427, 2004.

[31] F. Nasoz, K. Alvarez, C. L. Lisetti, and N. Finkelstein, “Emotion recog-nition from physiological signals using wireless sensors for presencetechnologies,” Cognition, Technol. Work, vol. 6, pp. 4–14, 2004.

[32] A. Haag, S. Goronzy, P. Schaich, and J. Williams, “Emotion recog-nition using bio-sensors: First steps towards an automatic system,” inAffective Dialogue Systems, volume 3068 of Lecture Notes in ComputerScience. Berlin, Germany: Springer, 2004, pp. 36–48.

[33] T. Bänziger, D. Grandjean, and K. R. Scherer, “Emotion recognitionfrom expressions in face, voice, and body: The multimodal emotionrecognition test (mert),” Emotion, vol. 9, no. 5, p. 691, 2009.

[34] Z.-J. Chuang and C.-H. Wu, “Multi-modal emotion recognition fromspeech and text,” Computat. Linguistic Chinese Language Process.,vol. 9, no. 2, pp. 45–62, 2004.

[35] M. D. Sazzad Hussain, R. A. Calvo, and P. A. Pour, “Hybrid fusionapproach for detecting affects from multichannel physiology,” in Af-fective Computing and Intelligent Interaction, volume 6974 of LectureNotes in Computer Science, S. D’Mello, A. Graesser, B. Schuller, andJ.-C. Martin, Eds. Berlin, Germany: Springer, 2011, pp. 568–577.

[36] R. A. Calvo and S. D’Mello, “Affect detection: An int erdisciplinaryreview of models, methods, and their applications,” IEEE Trans. Af-fective Comput., vol. 1, no. 1, pp. 18–37, Jan. 2010.

[37] Y. Zhao, “Human Emotion Recognition from Body Language of theHead using Soft Computing Techniques,” Ph.D. dissertation, Univer-sity of Ottawa, Ottawa, ON, Canada, 2012.

[38] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Multimodal ap-proaches for emotion recognition: A survey,” Internet Imaging VI,vol. 5670, pp. 56–67, 2005.

[39] F. Nack, C. Dorai, and S. Venkatesh, “Computational media aesthetics:Finding meaning beautiful,” IEEE Trans. MultiMedia, vol. 8, no. 4, pp.10–12, 2001.

[40] A. Hanjalic and L. Q. Xu, “Affective video content representation andmodeling,” IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143–154, 2005.

[41] S. Wang and Q. Ji, “Video affective content analysis: A survey of stateof the art methods,” IEEE Trans. Affective Computing, vol. 99, pp. 1–1,2015.

[42] M. Soleymani, Implicit and automated emotional tagging of videosMar. 11, 2011, ID: unige:17629.

[43] Y. Zhu, S. Wang, and Q. Ji, “Emotion recognition from users’ eeg sig-nals with the help of stimulus videos,” in Proc. 2014 IEEE Int. Conf.Multimedia Expo (ICME), Jul. 2014, pp. 1–6.

[44] S. Koelstra, C. Muhl, and I. Patras, “Eeg analysis for implicit taggingof video data,” in Proc. 3rd Int. Conf. Affective Comput. IntelligentInteract. Workshops, ACII 2009, 2009, pp. 1–6, IEEE.

[45] S. Koelstra, A. Yazdani, M. Soleymani, C. Mühl, J.-S. Lee, A. Nijholt,T. Pun, T. Ebrahimi, and I. Patras, “Single trial classification of eeg andperipheral physiological signals for recognition of emotions induced bymusic videos,” in Brain Informatics, volume 6334 of Lecture Notes inComputer Science. Berlin, Germany: Springer, 2010, pp. 89–100.

[46] Z. Ji and S. Qin, “Detection of eeg basic rhythm feature by using bandrelative intensity ratio (brir),” in Proc. IEEE 2003 Int. Conf. Acoust.,Speech, Signal Processing (ICASSP’03), 2003, vol. 6, pp. VI–429.

[47] H. L. Wang and L. F. Cheong, “Affective understanding in film,” IEEETrans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 689–704, 2006.

Page 12: IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3 ...cvrl/Publication/pdf/Wang2015c.pdf · IEEETRANSACTIONSONAUTONOMOUSMENTALDEVELOPMENT,VOL.7,NO.3,SEPTEMBER2015 189 EmotionRecognitionwiththeHelpof

200 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 7, NO. 3, SEPTEMBER 2015

[48] M. Sirko, P. Michael, S. Ralf, and N. Hermann, “Computing mel-fre-quency cepstral coefficients on the power spectrum,” in Proc. IEEE2001 Int. Conf. Acoust., Speech, Signal Processing (ICASSP’01), 2001,vol. 1, pp. 73–76.

[49] P. Boersma, “Praat, a system for doing phonetics by computer,” GlotInt., vol. 5, no. 9/10, pp. 341–345, 2002.

[50] D. Weenink, “Canonical correlation analysis,” in Proc. IFA, 2003, vol.25, pp. 81–99.

[51] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodaldatabase for affect recognition and implicit tagging,” IEEE Trans. Af-fective Comput., vol. 3, no. 1, pp. 42–55, 2012.

[52] G. H. Klem, H. Lüders, H. Jasper, and C. Elger, “The ten-twenty elec-trode system of the international federation. the international federationof clinical neurophysiology,” Electroencephalogr. Clin. Neurophys.,vol. 52, p. 3, 1999, Supplement.

[53] P. Krolak-Salmon, M. A. Hénaff, A. Vighetto, O. Bertrand, and F.Maugui’ere, “Early amygdala reaction to fear spreading in occipital,temporal, and frontal cortex: A depth electrode erp study in human,”Neuron, vol. 42, no. 4, pp. 665–676, 2004.

[54] M. Xu, J. S. Jin, S. Luo, and L. Duan, “Hierarchical movie affectivecontent analysis based on arousal and valence features,” in Proc. 16thACM Int. Conf. Multimedia, 2008, pp. 677–680.

Shangfei Wang (M’02) received the B.S. degree inelectronic engineering from Anhui University, Hefei,Anhui, China, in 1996. She received the M.S. de-gree in circuits and systems, and the Ph.D. degreein signal and information processing from Universityof Science and Technology of China (USTC), Hefei,Anhui, China, in 1999 and 2002, respectively.From 2004 to 2005, she was a postdoctoral

research fellow in Kyushu University, Japan. Be-tween 2011 and 2012, she was a visiting scholar atRensselaer Polytechnic Institute, Troy, NY, USA.

She is currently an Associate Professor of School of Computer Science andTechnology, USTC. Her research interests cover computation intelligence,affective computing, and probabilistic graphical models. She has authored orco-authored over 70 publications.Dr. Wang is a member of the IEEE and the ACM.

Yachen Zhu received the B.S. degree in computerscience from University of Science and Technologyof China (USTC), Hefei, China, in 2010. He iscurrently pursuing the Ph.D. degree in computerscience in the University of Science and Technologyof China, Hefei, China.His research interest is in affective computing.

Lihua Yue received the B.S. degree in computationalmathematics from University of Science and Tech-nology of China (USTC), Hefei, China, in 1975 andthe M.S. degree in software engineering from USTC,Hefei, China.She is an expert in database and used to be

the leader in developing database administratingsystems. She is currently a professor and Ph.D.supervisor in USTC. Her research area includedatabase systems and applications, informationintegration, and real-time database.

Qiang Ji (F’15) received the Ph.D. degree inelectrical engineering from the University of Wash-ington, Seattle, WA, USA.He is currently a Professor with the Department

of Electrical, Computer, and Systems Engineeringat Rensselaer Polytechnic Institute (RPI), Troy, NYUSA. He recently served as a program directorat the National Science Foundation (NSF), wherehe managed NSF’s computer vision and machinelearning programs. He also held teaching andresearch positions with the Beckman Institute at

University of Illinois at Urbana-Champaign, USA, the Robotics Institute atCarnegie Mellon University, Pittsburgh, PA, USA, the Department of Com-puter Science at University of Nevada at Reno, USA, and the U.S. Air ForceResearch Laboratory. Prof. Ji currently serves as the director of the IntelligentSystems Laboratory (ISL) at RPI. His research interests are in computer vision,probabilistic graphical models, information fusion, and their applications invarious fields. He has published over 160 papers in peer-reviewed journals andconferences. His research has been supported by major governmental agenciesincluding NSF, NIH, DARPA, ONR, ARO, and AFOSR as well as by majorcompanies including Honda and Boeing.Prof. Ji is an editor on several related IEEE and international journals and he

has served as a general chair, program chair, technical area chair, and programcommittee member in numerous international conferences/workshops. Prof. Jiis a Fellow of the IEEE and IAPR.