emotional speech detection laurence devillers, limsi-cnrs, [email protected]
DESCRIPTION
Emotional Speech. Emotional Speech detection Laurence Devillers, LIMSI-CNRS, [email protected] Expression of emotions in Speech synthesis Marc Schröder, DFKI, [email protected] Humaine Plenary Meeting, 4-6 June 2007, Paris. Overview. Challenge: - PowerPoint PPT PresentationTRANSCRIPT
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
11
Emotional Speech detectionEmotional Speech detectionLaurence Devillers, LIMSI-CNRS, Laurence Devillers, LIMSI-CNRS, [email protected]@limsi.fr
Expression of emotions in Speech synthesisExpression of emotions in Speech synthesisMarc Schröder, DFKI, Marc Schröder, DFKI, [email protected]@dfki.de
HumaineHumaine Plenary Meeting, Plenary Meeting,
4-6 June 2007, Paris4-6 June 2007, Paris
Emotional Speech
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
22
OverviewOverviewChallenge: Challenge:
RReal-time eal-time ssystem for “ystem for “real-life” emotional speech real-life” emotional speech detection in order detection in order
to build an affectively competent agentto build an affectively competent agent
Emotion is considered in the broad senseEmotion is considered in the broad sense
Real-life emotions are often shaded, blended, masked Real-life emotions are often shaded, blended, masked emotions due to social aspectsemotions due to social aspects
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
33
State-of-the-artState-of-the-art• Static emotion detection system (emotional unit level: Static emotion detection system (emotional unit level: word, chunk, sentence)word, chunk, sentence)• Statistical approach (such as SVM) using large amount of Statistical approach (such as SVM) using large amount of data to train modelsdata to train models• 4-6 emotions detected, rarely more4-6 emotions detected, rarely more
Emotion detection
P(Ei /O)
0: 0bservation
E models
The scheme shows the components of an automatic emotion recognition system The performances on realistic data (CEICES): 2 emotions > 80% 4 emotions >60%
Extractionfeatures
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
44
Automatic emotion detection Automatic emotion detection
The difficulty of the detection task increases The difficulty of the detection task increases with the variability of the emotional speech with the variability of the emotional speech expression.expression.
4 dimensions:4 dimensions:
• Speaker (dependent/independent, age, Speaker (dependent/independent, age, gender, health), gender, health), • Environment (transmission channel, noise Environment (transmission channel, noise environment),environment),• Number and type of emotions (primary, Number and type of emotions (primary, secondary)secondary)• Acted/real-life data and applications Acted/real-life data and applications context context
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
55
Automatic emotion detection: Research Automatic emotion detection: Research evolutionevolution 20072003
Speakers
Emotion representation
Acted/Woz/real-life data
EnvironmentTransmission
• Speaker-independent: Adaptation to gender
with adaptation
1996
Positive/Negative emotions HMI
Emotion/Unemotion (WoZ)
Primary acted-emotions
Channel-independent
Public place
Speaker-dependentPluri-speaker
Personality,Health, Age,Culture
actors
documentairesjournaux
fictions
TV clips
Phone
Quiet room Channel-dependent
•2- 5 realistic emotions (children, CEICES), HMIReal-life call-center emotions
Emotion in interaction .>5 Real emotions .• >4 acted-emotions
WoZ .Call center data .HMI .
Voice Superposition
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
66
Challenge with spontaneous Challenge with spontaneous emotionsemotions
• Authenticity is present but there is no control on the emotion Authenticity is present but there is no control on the emotion • Need to find appropriate labels and measures for annotation validationNeed to find appropriate labels and measures for annotation validation• Blended emotions (Scherer: Blended emotions (Scherer: Geneva Airport Lost Luggage StudyGeneva Airport Lost Luggage Study ))
Annotation and Validation of annotationAnnotation and Validation of annotation• Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)• Control of the quality of annotations:Control of the quality of annotations:
•Intra/Inter annotations agreementIntra/Inter annotations agreement• Perception tests Perception tests
•Validate the annotation scheme and the annotationsValidate the annotation scheme and the annotationsPerception of emotion mixtures (40 subjects) NEG/POS valencePerception of emotion mixtures (40 subjects) NEG/POS valenceImportance of the context Importance of the context
•Give measure for comparing human perception with automatic detection.Give measure for comparing human perception with automatic detection.
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
77
Human-Human Real-lifeHuman-Human Real-life Corpora Corpora
LIMSILIMSICorpusCorpus
Audio/Audio/Audio-VisuelAudio-Visuel
SizeSize #Speakers#Speakers Emotion classesEmotion classes
Stock Stock exchangeexchange
Call centerCall centerFrenchFrench
4.5h4.5h 100 callers/100 callers/4 agents4 agents
Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse
Financial loanFinancial loan Call centerCall centerFrenchFrench
2h2h 250 callers/ 250 callers/ 2 agents2 agents
Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse
MedicalMedical Call centerCall centerFrenchFrench
20h20h 784 callers/784 callers/7 agents7 agents
20 classes20 classes7 broad-classes7 broad-classes
EmoTabooEmoTabooEmotion Emotion inductioninduction
GameGameFrenchFrench
7h307h30 10 speakers10 speakers > 20 classes> 20 classes
EmoTVEmoTV TV newsTV newsFrenchFrench
< 1h< 1h 100 speakers100 speakers 14 - 35 classes14 - 35 classes7 macro7 macro
SAFESAFEactors actors
MoviesMovies 7h7h 400 speakers400 speakers Fear, other Neg, Fear, other Neg, Pos.Pos.
Audio
AudioVisuel
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
88
Context-dependent emotion labelsContext-dependent emotion labelsDo the labels represent the emotion of a considered task or context?Do the labels represent the emotion of a considered task or context?
Example: Real-life emotion studies (call center):Example: Real-life emotion studies (call center):The Fear label represents different expressions of Fear due to different The Fear label represents different expressions of Fear due to different
contexts: contexts: Fear for callers of losing money, Fear for callers for life, Fear for Fear for callers of losing money, Fear for callers for life, Fear for
agents of mistaking agents of mistaking
The difference is not just a question of intensity/activation The difference is not just a question of intensity/activation -> Primary/Secondary fear ? -> Primary/Secondary fear ? -> Degree of Urgency/reality of the threat ?-> Degree of Urgency/reality of the threat ?Fear in the fiction (movies): study of many different contextsFear in the fiction (movies): study of many different contexts
How to generalize ? Should we define labels in function of the type of context?How to generalize ? Should we define labels in function of the type of context?We just defined the social role (agent/caller) as a contextWe just defined the social role (agent/caller) as a context
See Poster of C. ClavelSee Poster of C. Clavel
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
99
Emotional labelsEmotional labels• The majority of the detection systems uses emotion The majority of the detection systems uses emotion
discrete representationdiscrete representation• Need a sufficient amount of data. In that objective, we Need a sufficient amount of data. In that objective, we
use hierarchical organization of labels (LIMSI example)use hierarchical organization of labels (LIMSI example)
Coarse level Fine-grained level
(8 classes) (20 classes + Neutral)
Fear Fear, Anxiety, Stress, Panic, Embarrassment
Anger Annoyance, Impatience, ColdAnger, HotAnger
Sadness Sadness, Dismay, Disappointment, Resignation, Despair
Hurt Hurt
Surprise Surprise
Relief Relief
Interest Interest, Compassion
Other Positive Amusement
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1010
No bad coders but No bad coders but different perceptionsdifferent perceptions
Combining annotations of different coders: Combining annotations of different coders: a Soft vector of emotionsa Soft vector of emotions
Labeler 1: (Labeler 1: (Major) Annoyance, (Minor) InterestMajor) Annoyance, (Minor) InterestLabeler 2:Labeler 2: (Major) Stress, (Minor) Annoyance (Major) Stress, (Minor) Annoyance
(wM/W (wM/W AnnoyanceAnnoyance, wm/W , wm/W StressStress, wm/W , wm/W Interest)Interest)
For wM=2 , wm=1 ,W=6For wM=2 , wm=1 ,W=6 (0.5 (0.5 AnnoyanceAnnoyance, 0.33 Stress, 0.17 Interest), 0.33 Stress, 0.17 Interest)..
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1111
Speech data processingSpeech data processing
WEKA toolkit :(www.cs.waikato.ac.nz - Witten & Franck, 1999)
~200 cues• Prosodic - F0- Formants- Energy
• Micro-•prosody- Jitter- Shimmer…
• Disfluences
• Affect bursts
transcription
WEKA:- attribute Selection
- SVM, ..
Praat
Lu motsPreprocessing• Stemming
N-gramsmodel
combination
LIMSI – see Poster L. Vidrascu Standard features•Pich level, range,•Energy level, range•Speaking rate•Spectral features (formants, Mfccs)
Less standard•Voice quality: local disturbances (jitter/shimmer)•Disfluences (pauses, filler pauses)•Affect bursts
We need to automatically detect affect bursts and to add new features such as voice quality features
Phone signal is not of sufficient quality for many existing techniques
see Ni Chasaide poster
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1212
50
55
60
65
70
75
80
85
90
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/N50
55
60
65
70
75
80
85
90
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/NFe:fear, Sd:sadness; Ag:anger; Ax anxi, St:stress, Re relief
LIMSI: Results with paralinguistic cues (SVMs): from 2 to 5 emotion classes (% of good detection)
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1313
25 best features 25 best features for 5 emotions detectionfor 5 emotions detection
feature type # of cues in the 25 bests F0 related 4
Energy 5 Microprosody 4
Formants 2 Duration from phonetic alignment 4
Other cues from transcription 6
The difference of the media channel (phone/microphone), the type of data (adult vs. children, realistic vs. naturalistic) and the emotion classes have an impact on the best relevant set of features.
Out of our 5 classes, Sadness is the least recognized without mixing the cues.
Features from all the classes were selected (different from one class to another)
Anger,Fear,Sadness,ReliefNeutral state
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1414
Real-life emotional systemReal-life emotional system
System based on acted data -> inadequate for real-life data detection System based on acted data -> inadequate for real-life data detection (Batliner)(Batliner)
GEMEP/CEMO comparison: different emotions GEMEP/CEMO comparison: different emotions First experiments show only an acceptable detection score for Anger.First experiments show only an acceptable detection score for Anger.
Real-life emotion studies are necessaryReal-life emotion studies are necessary
Detection results on call center data: state of the art for « realistic Detection results on call center data: state of the art for « realistic emotions »emotions »
> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1515
Short-term:Short-term:Acceptable solutions for targeted applications are in reachAcceptable solutions for targeted applications are in reachUse dynamic model of emotion for real-time emotion detection (history Use dynamic model of emotion for real-time emotion detection (history memory)memory)New features: Automatically extracted information on voice New features: Automatically extracted information on voice quality, affect bursts and disfluences from the signal that quality, affect bursts and disfluences from the signal that does not does not require exact speech recognitionrequire exact speech recognition..Detect relaxed/tensed voice (Scherer)Detect relaxed/tensed voice (Scherer)Add contextual knowledge to the blind statistical model: social Add contextual knowledge to the blind statistical model: social role, type of action, regulation (role, type of action, regulation (adapt emotional expression to strategic interaction goals (faces theory, Goffman)).
Long-term Long-term Emotion dynamic processus based on appraisal model.Emotion dynamic processus based on appraisal model.Combining informations at several levels: acoustic/linguistic, Combining informations at several levels: acoustic/linguistic, multimodal cues, adding contextual informations (social role)multimodal cues, adding contextual informations (social role)
Challenges aheadChallenges ahead
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1616
Demo (coffee break…)Demo (coffee break…)
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1717
ThanksThanks