speech emotion analysis in noisy real-world...

Speech Emotion Analysis in Noisy Real-World Environment

Ashish Tawari and Mohan TrivediUniversity of California San Diego, Dept. of ECE

[email protected], [email protected]

Abstract

Automatic recognition of emotional states via speechsignal has attracted increasing attention in recentyears. A number of techniques have been proposedwhich are capable of providing reasonably high accu-racy for controlled studio settings. However, their per-formance is considerably degraded when the speechsignal is contaminated by noise. In this paper, wepresent a framework with adaptive noise cancellationas front end to speech emotion recognizer. We also in-troduce a new feature set based on cepstral analysisof pitch and energy contours. Experimental analysisshows promising results.

1. Introduction

Speech signal convey not only words and meaningsbut also emotions. Besides human facial expressionsspeech has been proven to be another promising modal-ity for the recognition of human emotions. Most ofthe existing approaches to automatic affect analysis areaimed at recognition of a small number of prototypi-cal emotions (i.e. happiness, sadness, anger, fear, sur-prise, disgust and neutral) on acted speech. The factthat the acted behavior differs in audio profile and tim-ing from spontaneous behavior, has led research to shifttowards the analysis of spontaneous human behavior innaturalistic settings. In [12] a good overview of recentadvancement towards spontaneous behavior analysis ispresented. With the research shift toward spontaneousbehavior, many challenges have come to surface rang-ing from database collection strategies to the use of newfeature sets (e.g. lexical cues [3] apart from prosodicfeatures) and contextual information [10]. While auto-matic detection of six basic emotions in a controlled en-vironment can be done with reasonably high accuracy,detecting these emotions or any emotion in less con-strained settings is still a very challenging task. In thispaper, we investigate the effects of acoustic conditions

on the recognition of emotion using speech.

2 Speech Emotion Recognition in NoisyConditions: Overall Framework

We present a framework with adaptive noise cancel-lation as front end to speech emotion recognizer. Wealso introduce a new feature set based on cepstral anal-ysis of pitch and energy contours. We present our studyin application to driver assistance system. Today’s carare fitted with various interactive systems such as satel-lite navigation system, audio/video players, hands-freemobile telephony etc. These systems are often en-hanced by speech recognition as speech-based interac-tions are less distracting to the driver than interactionwith the visual display [7]. Therefore, to improve bothcomfort and safety in the car, the driver assistance tech-nologies need effective speech interface between thedriver and the infotainment system. Such real-worldenvironment, however, poses robustness issues as thesystem may be highly susceptible to noise [6].

To reflect the reality adequately, several noise sce-narios of approximately 60 seconds were recorded inthe car while driving. Interior noise depends on severalfactors such as vehicle type, road surfaces, outside envi-ronment etc. In this study, an instrumented Infinity Q45on-road testbed is used to record the interior noise infollowing scenario: highway (FWY), parking lot (PRK)and city street (CST). We also explore the performancefor white Gaussian noise (WGN).

Figure 1 provides the overview of the overall sys-tem. The input speech signal is processed by the speechenhancement block. The enhanced speech is then pro-cessed for further analysis. The classifiers are trainedusing clean speech signal. In the following sections, weprovide details of each block.

3 Adaptive Speech Enhancement

During last decades, several speech enhancementtechniques have been proposed ranging from beam-

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.1132

4585


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.1132

4613


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.1132

4605


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.1132

4605


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.1132

4605

Figure 1. Speech emotion recognitionoverview

forming through microphone arrays to adaptive noisefiltering approaches. In this work, we utilize speechenhancement technique based on the adaptive thresh-olding in wavelet domain [5]. Objective criterion forparameter selection, however, used in our experimentswas the classification performance as opposed to SNR.In particular, we chose 3-level wavelet packet trans-form. Smoothing factor (α) is chosen as 0.4 for noiseupdate stage. While, time constant (τ ) for ‘adaptivenode-dependent threshold update’ stage and exponentfactor (γ) for thresholding function are determined forbest classification performance as 2 and 3 respectively.Voice activity detector required in the speech enhance-ment technique is designed based on adaptive thresh-olding of signal subband energy similar to [9].

4 Extracting Meaningful Informationfrom Speech Signals

4.1 Feature extraction

A variety of features have been proposed to rec-ognize emotional states from speech signal. Thesefeatures can be categorized as acoustic features andlinguistic features. We avoid using the latter sincethese demand for robust recognition of speech in firstplace and also are a drawback for multi-language emo-tion recognition. Hence we only consider the acous-tic features. Within the acoustic category, we focus onprosodic like speech intensity, pitch and speaking rate,and spectral features like mel frequency cepstral coeffi-cients (MFCC) to model emotional states. For pitch cal-culation, we used the auto-correlation algorithm similarto [8]. Speech intensity is represented by log-energycoefficients calculated using 30ms frames with shift in-terval of 10ms.

The value of framewise parameters extracted from afew milliseconds of audio is of little significance to de-termine an emotional states. It is, on the contrary, ofinterest to capture the time trend of these features. Inorder to capture the characteristics of the contours, we

perform cepstrum analysis over the contour. For this,we first interpolate the contour to obtain samples sepa-rated by sampling period of speech signal, which is thenused to calculate the cepstrum coefficients as follows:

c(m) =1N

N−1∑k=0

log |X(k)|e2πkmj/N

m = 0, 1, · · · , N − 1

where X(k) demotes the N -point discrete Fouriertransform of the windowed signal (x(n)). Cesptrumanalysis is a source-filter separation process commonlyused in speech processing. Cepstrum coefficients c(0)to c(13) and their time derivative (first and second or-der), calculated from 480 samples, are utilized to ob-tain the spectral characteristic of the contours. For pitchcontour analysis only voiced portion is utilized. Otherfeatures that we utilized are 13 MFCCs C1 − C13 withtheir delta and acceleration components. Input signalis processed using 30ms hamming window with frameshift interval of 10ms.

For all these sequences following statistical infor-mation is calculated: mean, standard deviation, rel-ative maximum/minimum, position of relative maxi-mum/minimum, 1st quantile, 2nd quantile (median) and3rd quantile. Speaking rate is modeled as fraction of thevoiced segments. Thus, the total feature vector per seg-ment contains 3·(13+13+13)·9+1 = 1054 attributes.

4.2 Feature Selection and Normalization

Intuitively, a large number of features would im-prove the classification performance, however, in prac-tice a large feature space suffers from the phenomenonof ‘curse of dimensionality’. Therefore in order to im-prove the classification performance, a feature selectiontechnique is utilized. One such method to eliminate ir-relevant and redundant features is to identify featureswith high correlation with the class but low correlationamong themselves. We used CFSSubsetEval with best-first search strategy feature selection technique pro-vided by WEKA [4]. We used stratified 10 fold se-lection procedure, where the database is divided into10 folds so that they contain approximately the sameproportions of labels as the original database and ninefolds are used for feature selection. This is repeated10 times each time leaving out a different fold. Thus,we have 10 different sets of selected attributes. A scoreranging from 0 to 10 is, then, assigned to an attributebased on the number of times it has been selected. Wegroup the attributes with score at least n and call them‘n-10’ aggregate. Finally, we choose the group whichprovided the best recognition accuracy. ‘2-10’ aggre-

45864614460646064606

Figure 2. LISA test bed for data acquisi-tion

gate provided the best results on two databases used inour experiments.

Feature normalization is common technique to pro-vide more appropriate attributes to the learning schemeused. In this work, we used z-score technique. Thistransforms the original attributes v to new attributes v̂as,

v̂ =v − vmean

vstd

where vmean and vstd are mean and standard deviationof v respectively.

5 Databases

In this study, we have utilized two databases: theBerlin Database of Emotional Speech (EMO-DB)[2]and locally collected audio-visual database of effectiveexpression in a car (LISA-AVDB).

The first database provides more controlled environ-ment so that we can exclude other noise influences. Onthe other hand, second database presents more realisticscenario.

5.1 Berlin Database of Emotional speech

Studio recorded Berlin database, comprises of sixbasic emotions (anger, boredom, disgust, anxiety, hap-piness and sadness) as well as neutral speech. Ten pro-fessional German actors (5 male and 5 female) spoketen sentences in emotionally neutral content in the sevendifferent emotions. 494 phrases are recognized betterthan 80% and judged as natural for more than 60%by listeners. 84.3% accuracy are reported for a hu-man perception test. Distribution of the phrases over

Table 1. Confusion table using cleanspeech for EMODB database with 84.01%overall recognition rate

ReferenceEmotion

Recognized Emotionfea dis joy bor neu sad ang

fea 70.9 0 20.0 0 3.6 3.6 1.8dis 0 89.5 0 5.3 2.6 2.6 0joy 14.0 0 54.7 0 1.6 0 29.7bor 0 1.3 0 93.7 2.5 2.5 0neu 0 0 0 6.4 91.0 2.6 0sad 1.9 0 0 1.9 0 96.2 0ang 0.8 0.8 5.5 0 0.8 0 92.1

Table 2. Confusion table using cleanspeech for LISA-AVDB database with88.1% overall recognition rate

ReferenceEmotion

Recognized Emotionpos neu neg

pos 92.7 7.3 0neu 9.7 80.7 9.6neg 1.7 8.3 90.0

different emotions are: 55 frightened (fea); 38 dis-gusted(dis); 64 happy(joy); 79 bored(bor); 78 neu-tral(neu); 53 sad(sad); 127 angry(ang).

5.2 LISA audio-visual affect database

The audio-visual database is collected using an ana-log video camera facing the driver and an directionalmicrophone beneath steering wheel. Figure 2 shows thesettings of the camera and microphone. Video frameswere acquired approximately 30 frames per second andthe audio signal, captured, is resampled to 16 kHz sam-pling rate. The database is collected in stationary andmoving car environment. In this study, we analyzeemotional speech data from stationary car setting whichgives the effect of the cockpit of the car with relativelyhigh SNR value [11]. The emotional speech has beenlabeled into 3 groups ‘pos’, ‘neg’ and ‘neu’ for positive,negative and neutral expressions. The data has been ac-quired with 4 different subjects: 2 male and 2 female.Distribution of data for different categories is: 82 pos,82 neg, and 60 neu.

6 Experimental Analysis and Validation

All results were calculated from 10-fold cross vali-dation. For classification we use Support Vector Ma-chines (SVM) with linear kernel and 1-vs-1 multiclass

45874615460746074607

Table 3. Emotion recognition performanceusing different noise models at three SNRlevel for EMODB database

Overall recognition accuracy (%) for noisy and en-hanced speech

Test CasesSignal-to-Noise Ratio (dB)

15 10 5NS ES NS ES NS ES

WGN 31.2 55.0 28.0 44.9 17.5 37.5FWY 37.0 60.3 24.5 53.8 17.6 39.8PRK 35.0 63.0 23.5 51.9 16.4 39.2CST 35.0 61.2 24.7 54.3 16.8 39.0

NS: Noisy Speech; ES: Enhanced Speech

Table 4. Emotion recognition performanceusing different noise models at three SNRlevel for LISA-AVDB database

Overall recognition accuracy (%) for noisy and en-hanced speech

Test CasesSignal-to-Noise Ratio (dB)

15 10 5NS ES NS ES NS ES

WGN 81.7 82.1 73.6 78.5 56.6 75.4FWY 75.8 76.8 75.0 76.8 55.3 62.5PRK 78.1 77.57 71.5 73.2 53.6 65.1CST 78.5 78.1 71.0 75.8 56.7 69.1

NS: Noisy Speech; ES: Enhanced Speech

discrimination. Tables 1 and 2 provide the performancemeasurement in clean speech for EMO-DB and LISA-AVDB respectively. Over all recognition accuracy ob-tained for EMO-DB is over 84% and for LISA-AVDBover 87%. These resutls are comparable to other pub-lished approaches. [1] achieved recognition accuracy of83.8% on EMO-DB using fusion of GMM and HMMbased classifiers. The baseline results show the useful-ness of the feature set and classification technique. Dif-ferent noise models as explaned in section 2 are thenadded to clean speech at three SNR levels: 15 dB, 10dB and 5 dB. Over all recognition accuracy with andwithout front end processing (speech enhancement) arereported in Table 4 and 3 for the two databases. It canbe seen that the noise cancellation greatly improves theperformance. For EMO-DB database, seven class clas-sification problem at 5 dB SNR is just above chancelevel for noisy speech. For enhanced speech, however,the performance is boosted by over 100% for differ-ent noise models. High recognition accuracy of noisyspeech at higher SNR for LISA-AVDB can be attributedto relatively matched training and testing environmentas the database is collected in car setting.

7 Concluding RemarksRecognition of emotions in the speech signal is

highly susceptible to the acoustic environment. We pre-sented our analysis on speech signal collected in con-trolled environment (EMODB) as well as on less con-strained setting (LISA-AVDB). The results clearly showthat the proposed framework greatly improves the per-formance.

AcknowledgmentsThe authors gratefully acknowledge reviewers’ com-

ments. We are thankful to our colleagues at CVRR labfor useful discussions and assistances.

References

[1] E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, M. Ozkan,and A. Tekalp. Speech-driven automatic facial expres-sion synthesis. In 3DTV Conf: The True Vision - Cap-ture, Transmission and Display of 3D Video, pages 273–276, May 2008.

[2] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier,and B. Weiss. A database of german emotional speech.In Proc. Interspeech, pages 1517–1520, 2005.

[3] L. Devillers and L. Vidrascu. Real-life emotions de-tection with lexical and paralinguistic cues on human-human call center dialogs. In ICSLP, 2006.

[4] S. R. Garner. Weka: The waikato environment forknowledge analysis. In the New Zealand Computer Sci-ence Research Students Conf, pages 57–64, 1995.

[5] Y. Ghanbari and M. R. Karami-Mollaei. A new ap-proach for speech enhancement based on the adaptivethresholding of the wavelet packets. Speech Communi-cation, 48(8):927 – 940, 2006.

[6] M. Grimm and et. al. On the necessity and feasibilityof detecting a driver’s emotional state while driving. InACII, pages 126–138, 2007.

[7] H. Lunenfeld. Human factor considerations of motoristnavigation and information systems. In Vehicle Naviga-tion and Information Systems Conference, 1989.

[8] B. Paul. Accurate short-term analysis of the fundamen-tal frequency and the harmonics-to-noise ratio of a sam-pled sound. In Inst of Phonetic Sciences 17, pages 97–110, 1993.

[9] R. V. Prasad, A. Sangwan, H. S. Jamadagni, C. M.C,R. Sah, and V. Gaurav. Comparison of voice activitydetection algorithms for voip. In ISCC, page 530, 2002.

[10] A. Tawari and M. M. Trivedi. Context analysis in speechemotion recognition. IEEE Transaction on Multimedia,2010.

[11] A. Tawari and M. M. Trivedi. Contextual framework forspeech based emotion recognition in driver assistancesystem. In IEEE Intelligent Vehicles Symp, June 2010.

[12] Z. Zeng, M. Pantic, G. Roisman, and T. Huang. Asurvey of affect recognition methods: Audio, visual,and spontaneous expressions. PAMI, 31(1):39–58, Jan.2009.

45884616460846084608

speech emotion analysis in noisy real-world...

Documents