emotional speech detection laurence devillers, limsi-cnrs, [email protected]

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

11

Emotional Speech detectionEmotional Speech detectionLaurence Devillers, LIMSI-CNRS, Laurence Devillers, LIMSI-CNRS, [email protected]@limsi.fr

Expression of emotions in Speech synthesisExpression of emotions in Speech synthesisMarc Schröder, DFKI, Marc Schröder, DFKI, [email protected]@dfki.de

HumaineHumaine Plenary Meeting, Plenary Meeting,

4-6 June 2007, Paris4-6 June 2007, Paris

Emotional Speech

mailto:[email protected]

mailto:[email protected]


22

OverviewOverviewChallenge: Challenge:

RReal-time eal-time ssystem for “ystem for “real-life” emotional speech real-life” emotional speech detection in order detection in order

to build an affectively competent agentto build an affectively competent agent

Emotion is considered in the broad senseEmotion is considered in the broad sense

Real-life emotions are often shaded, blended, masked Real-life emotions are often shaded, blended, masked emotions due to social aspectsemotions due to social aspects


33

State-of-the-artState-of-the-art• Static emotion detection system (emotional unit level: Static emotion detection system (emotional unit level: word, chunk, sentence)word, chunk, sentence)• Statistical approach (such as SVM) using large amount of Statistical approach (such as SVM) using large amount of data to train modelsdata to train models• 4-6 emotions detected, rarely more4-6 emotions detected, rarely more

Emotion detection

P(Ei /O)

0: 0bservation

E models

The scheme shows the components of an automatic emotion recognition system The performances on realistic data (CEICES): 2 emotions > 80% 4 emotions >60%

Extractionfeatures


44

Automatic emotion detection Automatic emotion detection

The difficulty of the detection task increases The difficulty of the detection task increases with the variability of the emotional speech with the variability of the emotional speech expression.expression.

4 dimensions:4 dimensions:

• Speaker (dependent/independent, age, Speaker (dependent/independent, age, gender, health), gender, health), • Environment (transmission channel, noise Environment (transmission channel, noise environment),environment),• Number and type of emotions (primary, Number and type of emotions (primary, secondary)secondary)• Acted/real-life data and applications Acted/real-life data and applications context context


55

Automatic emotion detection: Research Automatic emotion detection: Research evolutionevolution 20072003

Speakers

Emotion representation

Acted/Woz/real-life data

EnvironmentTransmission

• Speaker-independent: Adaptation to gender

with adaptation

1996

Positive/Negative emotions HMI

Emotion/Unemotion (WoZ)

Primary acted-emotions

Channel-independent

Public place

Speaker-dependentPluri-speaker

Personality,Health, Age,Culture

actors

documentairesjournaux

fictions

TV clips

Phone

Quiet room Channel-dependent

•2- 5 realistic emotions (children, CEICES), HMIReal-life call-center emotions

Emotion in interaction .>5 Real emotions .• >4 acted-emotions

WoZ .Call center data .HMI .

Voice Superposition


66

Challenge with spontaneous Challenge with spontaneous emotionsemotions

• Authenticity is present but there is no control on the emotion Authenticity is present but there is no control on the emotion • Need to find appropriate labels and measures for annotation validationNeed to find appropriate labels and measures for annotation validation• Blended emotions (Scherer: Blended emotions (Scherer: Geneva Airport Lost Luggage StudyGeneva Airport Lost Luggage Study ))

Annotation and Validation of annotationAnnotation and Validation of annotation• Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)• Control of the quality of annotations:Control of the quality of annotations:

•Intra/Inter annotations agreementIntra/Inter annotations agreement• Perception tests Perception tests

•Validate the annotation scheme and the annotationsValidate the annotation scheme and the annotationsPerception of emotion mixtures (40 subjects) NEG/POS valencePerception of emotion mixtures (40 subjects) NEG/POS valenceImportance of the context Importance of the context

•Give measure for comparing human perception with automatic detection.Give measure for comparing human perception with automatic detection.


77

Human-Human Real-lifeHuman-Human Real-life Corpora Corpora

LIMSILIMSICorpusCorpus

Audio/Audio/Audio-VisuelAudio-Visuel

SizeSize #Speakers#Speakers Emotion classesEmotion classes

Stock Stock exchangeexchange

Call centerCall centerFrenchFrench

4.5h4.5h 100 callers/100 callers/4 agents4 agents

Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse

Financial loanFinancial loan Call centerCall centerFrenchFrench

2h2h 250 callers/ 250 callers/ 2 agents2 agents

Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse

MedicalMedical Call centerCall centerFrenchFrench

20h20h 784 callers/784 callers/7 agents7 agents

20 classes20 classes7 broad-classes7 broad-classes

EmoTabooEmoTabooEmotion Emotion inductioninduction

GameGameFrenchFrench

7h307h30 10 speakers10 speakers > 20 classes> 20 classes

EmoTVEmoTV TV newsTV newsFrenchFrench

< 1h< 1h 100 speakers100 speakers 14 - 35 classes14 - 35 classes7 macro7 macro

SAFESAFEactors actors

MoviesMovies 7h7h 400 speakers400 speakers Fear, other Neg, Fear, other Neg, Pos.Pos.

Audio

AudioVisuel


88

Context-dependent emotion labelsContext-dependent emotion labelsDo the labels represent the emotion of a considered task or context?Do the labels represent the emotion of a considered task or context?

Example: Real-life emotion studies (call center):Example: Real-life emotion studies (call center):The Fear label represents different expressions of Fear due to different The Fear label represents different expressions of Fear due to different

contexts: contexts: Fear for callers of losing money, Fear for callers for life, Fear for Fear for callers of losing money, Fear for callers for life, Fear for

agents of mistaking agents of mistaking

The difference is not just a question of intensity/activation The difference is not just a question of intensity/activation -> Primary/Secondary fear ? -> Primary/Secondary fear ? -> Degree of Urgency/reality of the threat ?-> Degree of Urgency/reality of the threat ?Fear in the fiction (movies): study of many different contextsFear in the fiction (movies): study of many different contexts

How to generalize ? Should we define labels in function of the type of context?How to generalize ? Should we define labels in function of the type of context?We just defined the social role (agent/caller) as a contextWe just defined the social role (agent/caller) as a context

See Poster of C. ClavelSee Poster of C. Clavel


99

Emotional labelsEmotional labels• The majority of the detection systems uses emotion The majority of the detection systems uses emotion

discrete representationdiscrete representation• Need a sufficient amount of data. In that objective, we Need a sufficient amount of data. In that objective, we

use hierarchical organization of labels (LIMSI example)use hierarchical organization of labels (LIMSI example)

Coarse level Fine-grained level

(8 classes) (20 classes + Neutral)

Fear Fear, Anxiety, Stress, Panic, Embarrassment

Anger Annoyance, Impatience, ColdAnger, HotAnger

Sadness Sadness, Dismay, Disappointment, Resignation, Despair

Hurt Hurt

Surprise Surprise

Relief Relief

Interest Interest, Compassion

Other Positive Amusement


1010

No bad coders but No bad coders but different perceptionsdifferent perceptions

Combining annotations of different coders: Combining annotations of different coders: a Soft vector of emotionsa Soft vector of emotions

Labeler 1: (Labeler 1: (Major) Annoyance, (Minor) InterestMajor) Annoyance, (Minor) InterestLabeler 2:Labeler 2: (Major) Stress, (Minor) Annoyance (Major) Stress, (Minor) Annoyance

(wM/W (wM/W AnnoyanceAnnoyance, wm/W , wm/W StressStress, wm/W , wm/W Interest)Interest)

For wM=2 , wm=1 ,W=6For wM=2 , wm=1 ,W=6 (0.5 (0.5 AnnoyanceAnnoyance, 0.33 Stress, 0.17 Interest), 0.33 Stress, 0.17 Interest)..


1111

Speech data processingSpeech data processing

WEKA toolkit :(www.cs.waikato.ac.nz - Witten & Franck, 1999)

~200 cues• Prosodic - F0- Formants- Energy

• Micro-•prosody- Jitter- Shimmer…

• Disfluences

• Affect bursts

transcription

WEKA:- attribute Selection

- SVM, ..

Praat

Lu motsPreprocessing• Stemming

N-gramsmodel

combination

LIMSI – see Poster L. Vidrascu Standard features•Pich level, range,•Energy level, range•Speaking rate•Spectral features (formants, Mfccs)

Less standard•Voice quality: local disturbances (jitter/shimmer)•Disfluences (pauses, filler pauses)•Affect bursts

We need to automatically detect affect bursts and to add new features such as voice quality features

Phone signal is not of sufficient quality for many existing techniques

see Ni Chasaide poster

http://www.cs.waikato.ac.nz/


1212

50

55

60

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/N50

55

60

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/NFe:fear, Sd:sadness; Ag:anger; Ax anxi, St:stress, Re relief

LIMSI: Results with paralinguistic cues (SVMs): from 2 to 5 emotion classes (% of good detection)


1313

25 best features 25 best features for 5 emotions detectionfor 5 emotions detection

feature type # of cues in the 25 bests F0 related 4

Energy 5 Microprosody 4

Formants 2 Duration from phonetic alignment 4

Other cues from transcription 6

The difference of the media channel (phone/microphone), the type of data (adult vs. children, realistic vs. naturalistic) and the emotion classes have an impact on the best relevant set of features.

Out of our 5 classes, Sadness is the least recognized without mixing the cues.

Features from all the classes were selected (different from one class to another)

Anger,Fear,Sadness,ReliefNeutral state


1414

Real-life emotional systemReal-life emotional system

System based on acted data -> inadequate for real-life data detection System based on acted data -> inadequate for real-life data detection (Batliner)(Batliner)

GEMEP/CEMO comparison: different emotions GEMEP/CEMO comparison: different emotions First experiments show only an acceptable detection score for Anger.First experiments show only an acceptable detection score for Anger.

Real-life emotion studies are necessaryReal-life emotion studies are necessary

Detection results on call center data: state of the art for « realistic Detection results on call center data: state of the art for « realistic emotions »emotions »

> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions


1515

Short-term:Short-term:Acceptable solutions for targeted applications are in reachAcceptable solutions for targeted applications are in reachUse dynamic model of emotion for real-time emotion detection (history Use dynamic model of emotion for real-time emotion detection (history memory)memory)New features: Automatically extracted information on voice New features: Automatically extracted information on voice quality, affect bursts and disfluences from the signal that quality, affect bursts and disfluences from the signal that does not does not require exact speech recognitionrequire exact speech recognition..Detect relaxed/tensed voice (Scherer)Detect relaxed/tensed voice (Scherer)Add contextual knowledge to the blind statistical model: social Add contextual knowledge to the blind statistical model: social role, type of action, regulation (role, type of action, regulation (adapt emotional expression to strategic interaction goals (faces theory, Goffman)).

Long-term Long-term Emotion dynamic processus based on appraisal model.Emotion dynamic processus based on appraisal model.Combining informations at several levels: acoustic/linguistic, Combining informations at several levels: acoustic/linguistic, multimodal cues, adding contextual informations (social role)multimodal cues, adding contextual informations (social role)

Challenges aheadChallenges ahead


1616

Demo (coffee break…)Demo (coffee break…)


1717

ThanksThanks

emotional speech detection laurence devillers, limsi-cnrs, [email protected]

Documents

devillers plenary

automatic emotion detection

real emotions

masked emotions

emotion studies

plenary meeting

emotional labels

realistic emotions children