m. brendel 1, r. zaccarelli 1, l. devillers 1,2 1 limsi-cnrs, 2 paris-south university french...

M. Brendel1, R. Zaccarelli1, L. Devillers1,2

1LIMSI-CNRS, 2Paris-South University

French National Research Agency - Affective Avatar project (2007-2010)

Example of application of the system for emotions detection from speech to control an affective avatar:

Skype (the speaker is depicted by his/her avatar)

The avatar should show the expressive behavior in facial and gesture corresponding to the emotion detected in voice – speech synchronized with the lips movements

The mapping between the output of the emotion detection system and the expressive avatar was done with the ECA team at LIMSI and the other partners

This application has two main challenges for emotion detection: speaker- independent emotion detection real-time emotion detection

We focus on emotion detection for 4 macro-classes: Anger ( Annoyance, Hot anger, Impatience) Sadness (Disappointment, Sadness) Positive (Amusement, Joy, Satisfaction) Neutral

Choice of appropriate corpora for training models -> fundamental. Data must be as close as possible to the behaviors observed in the

real application but sometimes such application does not exist Corpus must be large enough, Large number of speakers, Sufficient variability of emotional expressions, including complex,

mixed and shaded emotions.

Available corpora in the community are mainly acted and small including few speakers and little variations in the expression of emotions without any application in sight.

LIMSI corpora mainly collected in call centers (bank, emergency or stock exchange call centers) with lots of negative emotions.

CINEMO (Rollet et al., 2009, Schuller et al., 2010)

contains acted emotional expression (mainly everyday situations) obtained by playing dubbing exercices (Cine Karaoké) by 50 speakers - manually segmented - 2 coders (annotation scheme allows to annotate mixtures of emotion) lot of shaded emotions and mixtures JEMO

obtained by an emotion detection game with 39 speakers - first prototype of a real time detection system – automatically segmented - 2 coders prototypical emotions: very few mixtures of emotions were annotated

Question of this paper: Can we mix different kinds of corpus recorded in same conditions for training more efficient classifier?

Sub-corpusCINEMO (50 speakers)

POS SAD ANG NEU TOTAL

#segments

313 364 344 510 1012

Sub-corpora on consensual segments were chosen for training models for detection of 4 classes – we have not considered mixtures of emotions

Sub-corpusJEMO (38 diff. speakers)

POS SAD ANG NEU TOTAL

#segments

316 223 179 416 1062

Distance=Anger mean−All mean

All mean

We have compared Corpus-anger with Corpus-All with some acoustic features, we have plotted the for each feature across the three corpora.

1- rolloff05% 2- rolloff25% 3- rolloff50% 4- rolloff75% 5- rolloff95%6- centroid 7- spectralslope 8- spectralorigin 9- bandenergy0-250 10- bandenergy250-650 11- bandenergy0-650 12- bandenergy650-1k 13- bandenergy1k-4k 14- barkband1...37- barkband24 38- mfcc0 39- mfcc1…50- mfcc12 51- zcr 52- meanloudness 53- rmsintensity *1000 54- rapportMaxMinF0 55- varF0 56- F2-F1 57- F3-F2 58- varF1 59- varF2 60- varF3 61- voicedratio 62- jitterlocal 63- shimmerlocal 64- HNR

(Tahon & devillers, Speech Prosody 2010)

Computed on voiced segments

Low-Level Descriptors (# nb computed with functionals)

Functionals

Energy (29) moments(2)

RMS Energy (22) absolute mean, max

FO (23) extremes(2)

Zero-Crossing Rate (18) 2 x values, range

MFCC 1-16 (366) linear regression(2)

MAE/MSE, slope

quartiles(2)

quartile, tquartile

RR/UAR Test CINEMO Test JEMO

Train CINEMO 0.50/ 0.48 0.51/0.48

Train JEMO 0.43/0.39 0.60/0.55

• Training on CINEMO and testing on JEMO performs better than vice-versa.

• It seems better to train on a wider set (more variability of emotional expressions in CINEMO – different contexts) and test on a narrower (JEMO contains more prototypical emotions) than the other way.

• Surprisingly, training on CINEMO then testing on JEMO gives a slightly better performance than testing on CINEMO itself.

RR/UAR

SpD CV WEKA

SpI CV

CINEMO

0.57/0.56

0.50/ 0.48

JEMO 0.63/0.59

0.60/0.55

CINEMO+JEMO

0.58/0.56

0.54/0.51

Be careful, WEKA use SPD cross-validation!

This means that the unification of the corpora improved the results.

We could not be better than JEMO, but it is obvious that the good result of JEMO on itself is because it is a small corpus with prototypical emotions only, and it has no good generalization power: training on JEMO, testing on CINEMO 0.43/0.39

After balancing tests, We can also conclude that the performance improvement is mainly due to the large number of instances.

RR/UAR SPI CVAll features

SPI CVSFFS

Female 0.59/ 0.55 0.65 (31 features)

Male 0.52/0.49 0.55 (38 features)

All 0.54/0.51

Positive Sadness

Anger Neutral

Male 252 262 267 432

Female 377 325 256 494

Unification of both corpora (88 speakers) allows to improve the results: The number of instances is approximately

doubled The classes are more balanced The two corpora enrich each other

Spitting the corpus along gender is also beneficial: the models trained on the sub-corpus are better; Gender information was available in our

application of affective avatar features selection seems also beneficial (need

cross-corpora studies)

Emotional databases often small, sparse resource when using natural context (often less than 10% of utt. are emotional), difficult to build generic models from one corpus

Find measures for qualifying emotional databases

Cross-corpora studies are very important

Use multiple corpora collected in different contexts to train models

Thanks for attention

m. brendel 1, r. zaccarelli 1, l. devillers 1,2 1 limsi-cnrs, 2 paris-south university french...

Documents

emotions detection

emotion detection system

emotion detection game

expression of emotions

negative emotions

segments3162231794161062

speakers possadangneutotal

limsi corpora