m. brendel 1, r. zaccarelli 1, l. devillers 1,2 1 limsi-cnrs, 2 paris-south university french...
TRANSCRIPT
M. Brendel1, R. Zaccarelli1, L. Devillers1,2
1LIMSI-CNRS, 2Paris-South University
French National Research Agency - Affective Avatar project (2007-2010)
Example of application of the system for emotions detection from speech to control an affective avatar:
Skype (the speaker is depicted by his/her avatar)
The avatar should show the expressive behavior in facial and gesture corresponding to the emotion detected in voice – speech synchronized with the lips movements
The mapping between the output of the emotion detection system and the expressive avatar was done with the ECA team at LIMSI and the other partners
This application has two main challenges for emotion detection: speaker- independent emotion detection real-time emotion detection
We focus on emotion detection for 4 macro-classes: Anger ( Annoyance, Hot anger, Impatience) Sadness (Disappointment, Sadness) Positive (Amusement, Joy, Satisfaction) Neutral
Choice of appropriate corpora for training models -> fundamental. Data must be as close as possible to the behaviors observed in the
real application but sometimes such application does not exist Corpus must be large enough, Large number of speakers, Sufficient variability of emotional expressions, including complex,
mixed and shaded emotions.
Available corpora in the community are mainly acted and small including few speakers and little variations in the expression of emotions without any application in sight.
LIMSI corpora mainly collected in call centers (bank, emergency or stock exchange call centers) with lots of negative emotions.
CINEMO (Rollet et al., 2009, Schuller et al., 2010)
contains acted emotional expression (mainly everyday situations) obtained by playing dubbing exercices (Cine Karaoké) by 50 speakers - manually segmented - 2 coders (annotation scheme allows to annotate mixtures of emotion) lot of shaded emotions and mixtures JEMO
obtained by an emotion detection game with 39 speakers - first prototype of a real time detection system – automatically segmented - 2 coders prototypical emotions: very few mixtures of emotions were annotated
Question of this paper: Can we mix different kinds of corpus recorded in same conditions for training more efficient classifier?
Sub-corpusCINEMO (50 speakers)
POS SAD ANG NEU TOTAL
#segments
313 364 344 510 1012
Sub-corpora on consensual segments were chosen for training models for detection of 4 classes – we have not considered mixtures of emotions
Sub-corpusJEMO (38 diff. speakers)
POS SAD ANG NEU TOTAL
#segments
316 223 179 416 1062
Distance=Anger mean−All mean
All mean
We have compared Corpus-anger with Corpus-All with some acoustic features, we have plotted the for each feature across the three corpora.
1- rolloff05% 2- rolloff25% 3- rolloff50% 4- rolloff75% 5- rolloff95%6- centroid 7- spectralslope 8- spectralorigin 9- bandenergy0-250 10- bandenergy250-650 11- bandenergy0-650 12- bandenergy650-1k 13- bandenergy1k-4k 14- barkband1...37- barkband24 38- mfcc0 39- mfcc1…50- mfcc12 51- zcr 52- meanloudness 53- rmsintensity *1000 54- rapportMaxMinF0 55- varF0 56- F2-F1 57- F3-F2 58- varF1 59- varF2 60- varF3 61- voicedratio 62- jitterlocal 63- shimmerlocal 64- HNR
(Tahon & devillers, Speech Prosody 2010)
Computed on voiced segments
Low-Level Descriptors (# nb computed with functionals)
Functionals
Energy (29) moments(2)
RMS Energy (22) absolute mean, max
FO (23) extremes(2)
Zero-Crossing Rate (18) 2 x values, range
MFCC 1-16 (366) linear regression(2)
MAE/MSE, slope
quartiles(2)
quartile, tquartile
RR/UAR Test CINEMO Test JEMO
Train CINEMO 0.50/ 0.48 0.51/0.48
Train JEMO 0.43/0.39 0.60/0.55
• Training on CINEMO and testing on JEMO performs better than vice-versa.
• It seems better to train on a wider set (more variability of emotional expressions in CINEMO – different contexts) and test on a narrower (JEMO contains more prototypical emotions) than the other way.
• Surprisingly, training on CINEMO then testing on JEMO gives a slightly better performance than testing on CINEMO itself.
RR/UAR
SpD CV WEKA
SpI CV
CINEMO
0.57/0.56
0.50/ 0.48
JEMO 0.63/0.59
0.60/0.55
CINEMO+JEMO
0.58/0.56
0.54/0.51
Be careful, WEKA use SPD cross-validation!
This means that the unification of the corpora improved the results.
We could not be better than JEMO, but it is obvious that the good result of JEMO on itself is because it is a small corpus with prototypical emotions only, and it has no good generalization power: training on JEMO, testing on CINEMO 0.43/0.39
After balancing tests, We can also conclude that the performance improvement is mainly due to the large number of instances.
12
RR/UAR SPI CVAll features
SPI CVSFFS
Female 0.59/ 0.55 0.65 (31 features)
Male 0.52/0.49 0.55 (38 features)
All 0.54/0.51
Positive Sadness
Anger Neutral
Male 252 262 267 432
Female 377 325 256 494
Unification of both corpora (88 speakers) allows to improve the results: The number of instances is approximately
doubled The classes are more balanced The two corpora enrich each other
Spitting the corpus along gender is also beneficial: the models trained on the sub-corpus are better; Gender information was available in our
application of affective avatar features selection seems also beneficial (need
cross-corpora studies)
15
Emotional databases often small, sparse resource when using natural context (often less than 10% of utt. are emotional), difficult to build generic models from one corpus
Find measures for qualifying emotional databases
Cross-corpora studies are very important
Use multiple corpora collected in different contexts to train models
Thanks for attention