annotation and detection of blended emotions in real human-human dialogs recorded in a call center...
TRANSCRIPT
Annotation and Detection of Blended Emotions
in Real Human-Human Dialogs recorded in a Call Center
L. Vidrascu and L. Devillers
TLP-LIMSI/CNRS - France
IST AMITIES FP5 Project Automated Multi-lingual Interaction with Information and Services
HUMAINE FP6 NoE Human-Machine Interaction on Emotion
CHIL FP6 Project Computer in the Human Interaction Loop
Vidrascu & Devillers - IEEE ICME 2005
Introduction Study of real-life emotions to improve the capacities of
current speech technologies detecting emotions can help by orienting the evolution of H-C
interaction via dynamic modification of dialog strategies
Most previous works on emotion have been conducted on acted or induced data with archetypal emotions.
Results on artificial data transfer poorly to real data expression of emotion complex: blended, shaded, masked dependent of contextual and social factors expressed at many different levels: prosodic, lexical, etc
Challenges for detecting emotions in real-life data representation of complex emotion robust annotation validation protocol
Vidrascu & Devillers - IEEE ICME 2005
Outline
real-life corpus recorded in a call center call centers are very interesting environments
because the recording can be made imperceptibly
emotion annotation emotion detection blended emotions perspectives
Vidrascu & Devillers - IEEE ICME 2005
Corpus recorded at a Web-based Stock Exchange Customer Service Center Dialogs are real agent-client interactions in French
covering a range of investment topics, account management and Web questions or problems,
5229 speech turns making 5012 in-task exchanges.
# agents 4# turns/dialog Average: 50 # words/turn Average: 9# words total 44.1 k
# clients 100 Min: 5 Max: 227 Min: 1 Max: 128# distinct 3k
Vidrascu & Devillers - IEEE ICME 2005
Outline
real-life corpus description emotion annotation phase is complex
definition of emotion representation and emotional unit
annotation validation
emotion detection blended emotions perspectives
Vidrascu & Devillers - IEEE ICME 2005
Three types of emotion representation describing emotions via appraisal
dimensions (Scherer, 1999) novelty, pleasantness, etc
describing emotions via abstract dimensions (Osgood, 1975) activation: active/passive, valence: negative/positive control: relation to stimulus
verbal categories 8 primary universal emotions for Ekman (2002) Primary vs. secondary/social (Plutchik, 1994)
Vidrascu & Devillers - IEEE ICME 2005
Emotion Definition and Annotation We consider emotion in a broad sense including attitudes and
emotions
Definition set of 5 task-dependent emotion labels : anger and
fear emotions, excuse, satisfaction, neutral attitudes. emotional unit: speaker turn
Dialog corpus labeled with audio listening 2 independent annotators: ambiguities ~3%
Anger Fear Exc. Sat. Neutr.
Client 9.9% 6.7% 0.1% 2.6% 80.7%
Agent 0.7% 1.3% 1.8% 4.0% 92.1%
Vidrascu & Devillers - IEEE ICME 2005
Annotation Validation Inter-annotation agreement measure
Kappa=0,8
Perceptual test to validate the presence of emotions in the corpus Test data: 40 speaker turns & 20 native French
subjects 75% of negative emotions were well-detected
Ref: Devillers, L., Vasilescu I., Mathon, C., (2003), “Acoustic cues for perceptual emotion detection in task-oriented Human-Human corpus”, 15th ICPhS, Barcelona
Vidrascu & Devillers - IEEE ICME 2005
Outline
real-life corpus description emotion annotation emotion detection
Prosodic, acoustic and some disfluencies cues Neutral/Negative, Fear/Anger classification
blended emotions perspectives
Vidrascu & Devillers - IEEE ICME 2005
Prosodic, acoustic and disfluencies cues
Crucial point: Selection of a set of relevant features Not well established / appears to be data-dependent Big and redundant set of features:
F0 features : min, max, mean, standard deviation, range, slope, regression coefficient and its mean square error, cross-variation of F0 between two adjoining voiced segments.
Energy features min, max, mean, standard deviation, range
Duration features: speaking rate (inverse of the average length of the speech voiced parts)
Other acoustic features: formants (first and second), and their bandwidths
Speech disfluencies cues: number and length of silent pauses (unvoiced parts between 200-800 ms) and filler pauses “euh”
Vidrascu & Devillers - IEEE ICME 2005
Speech Data Processing F0, Energy and acoustic cues
extraction (Praat)Ex: F0 processing, z-score normalization Since F0 feature detection is subject to error,
segments with duration of less than 30 ms are eliminated (1.4% of the segments, balanced on classes)
Automatic alignment for filler and silent pauses extraction: LIMSI system (HMMs with Gaussian mixtures
for acoustic modeling) word alignment was manually verified for
speaker turns labeled with negative emotions
Vidrascu & Devillers - IEEE ICME 2005
Features selection and Detection systems
Weka toolkit (www.cs.waikato.ac.nz): collection of machine learning algorithms for data mining tests selection of subsets of the best attributes
SVM predictif, Entropy measure (infogain), Correlation based Feature Selection
classifiers tested Decision tree that uses pruning (C4.5) Support Vector Machine (SVM) Voting algorithms (ADTree and Adaboost): combine
the outputs of different models
Vidrascu & Devillers - IEEE ICME 2005
Neutral/Negative emotion detection
using prosodic and acoustic cues, Jackknifing proc. (30 runs)
C4.5 AdaBoost ADTree SVM
5att
72.8 ( 5.2) 71.2 (4.5) 72.3(4.6) 67.2(6.3 )
10att
73.0 ( 5.3) 71.5( 4.8) 73.0( 5.7) 69.5( 5.6)
15att 71.7 ( 6.4) 71.1( 4.7) 71.6( 4.9) 70.8( 4.9)
20att
71.8 ( 5.3) 71.3( 4.3) 71.8( 5.1) 71.0( 4.9)
allatt 69.4 ( 5.6) 71.7( 4.3) 71.6( 4.8) 69.6( 3.5)
• Very few attributes (5att) yield high level of detection• Little differences between the different techniques
Vidrascu & Devillers - IEEE ICME 2005
Anger/Fear emotion detection
Decision Tree classifier:
56% correct detection with prosodic and acoustic cues 60% when adding disfluencies (silent pauses and filler pauses « euh ») cues
We hypothesize that this low performance is due to blended emotions
Vidrascu & Devillers - IEEE ICME 2005
Outline
real-life corpus description emotion annotation emotion detection blended emotions
In certain states of mind, it is possible to exhibit more than one emotion when trying to mask a feeling, conflicting emotions, suffering, etc
perspectives
Vidrascu & Devillers - IEEE ICME 2005
Blended emotions
In this financial task, Anger and Fear can be combined: « Clients can be angry because they are afraid of losing money » Confusion matrix (40% confusion): there are as
many « Anger classified Fear » as « Fear classified Anger ».
Re-annotation procedure of negative emotions with a new scheme defined for other tasks (medical call center, EmoTV), 2 different annotators
Vidrascu & Devillers - IEEE ICME 2005
New emotion annotation scheme
allows to choose 2 labels per segment: Major emotion: which is perceived as dominant Minor emotion: if another emotion is perceived
in background (the most intense minor emotion)
7 coarse classes (defined for another task) Fear, Sadness, Anger, Hurt, Positive, Surprise,
Neutral attitude
Vidrascu & Devillers - IEEE ICME 2005
Perception of emotion is very subjectiveHow to mix different annotations? Labeler 1: Major Anger, Minor SadnessLabeler 2: Major Fear, Minor Anger
exploit the differences by combining the labels from multiple annotators in a soft emotion vector
-> (wM/W Anger, wm/W Fear, wm/W Sadness)
For wM=2 , wm=1 ,W=6 in this example
-> (3/6 Anger, 2/6 Fear, 1/6 Sadness)
Vidrascu & Devillers - IEEE ICME 2005
Re-annotation result Because, we are focusing on Anger and Fear emotions, 4 classes were deduced from emotion vectors:
Fear (Fear>0; Anger=0) Anger (Fear=0; Anger>0) Blended emotion (Fear>0; Anger>0) Other (Fear=0; Anger=0)
Consistance between the first and the second annotation for 78% utterances If (Anger >= Fear) and previous annotation Anger ->
consistance
Same Major label in 64% utterances No common labels between the two annotators: 13%
Vidrascu & Devillers - IEEE ICME 2005
Re-annotation results
Validation of the presence of mixtures of emotion in the Anger and Fear segments Excerpt taken from a call: Client: “No, but I haven’t handled it at all. I was on holidays, I got a letter, about 4… 400 euros were missing…”
0
10
20
30
40
50
60
70
%
Other Fear Blend Anger
Reannotation of Fear and Anger
Fear Anger (1st annotation scheme)
Vidrascu & Devillers - IEEE ICME 2005
Summary and perspectives Detection performance
73% correct detection between Neutral and Negative emotion whereas only 60% between Fear and Anger
Validation of the presence of mixtures of Fear/Anger emotion
Emotion representation: Soft emotion vector medical call center corpus (20h annotated) multimodal corpus of TV interviews (EmoTV-HUMAINE)
Perspectives improve detection performance by using non complex
part of the corpus for training model analyse real-life blended emotions and perceptual test
on blended emotions
Vidrascu & Devillers - IEEE ICME 2005
Thank you for your attention
reference: L. Devillers, L. Vidrascu, L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection”, Special issue, Journal of Neural Networks, to appear in July 2005.
Vidrascu & Devillers - IEEE ICME 2005
Combining lexical and paralinguistic
60
70
80
1 2 3 4 5 6 7 8 9 10
Combining Lexical and Paralinguistic scores
LexicalParalinguisticLexical + Paralinguistic
lexical unigram model : 78% detection neutral/negative linear combination of the 2 scores on 10 test sets (50 utterances)
Vidrascu & Devillers - IEEE ICME 2005
Emotion Detection Model
Emotion detection model is based on unigram models Due to the sparseness of the on-emotion data, each emotion model is an interpolation of an Emotion-specific model
and a General task-specific model estimated on the entire training corpus. The similarity between u and E is the normalized log likelihood ratio between an emotion model and the general
model.
Standard preprocessing procedures: compounding (negative forms ex: « pas_normal »), stemming, and stopping
)(
)(1/log,1)/(logwP
wPEwPuwtfL
EuPuwu
PL
EuPu
log1/log
Vidrascu & Devillers - IEEE ICME 2005
Experiments on Anger/Fear detection
Prosodic and acoustic cues 56% of detection around 60% when disfluencies are added
Lexical cues: ICME 2003 often same lexical words : problem, abnormal, etc difference is much more syntactic than lexical