emotional speech detection laurence devillers, limsi-cnrs, [email protected]

17
L-Devillers - Plenary L-Devillers - Plenary 1 Emotional Speech detection Emotional Speech detection Laurence Devillers, LIMSI-CNRS, Laurence Devillers, LIMSI-CNRS, [email protected] [email protected] Expression of emotions in Speech Expression of emotions in Speech synthesis synthesis Marc Schröder, DFKI, Marc Schröder, DFKI, [email protected] [email protected] Humaine Humaine Plenary Meeting, Plenary Meeting, 4-6 June 2007, Paris 4-6 June 2007, Paris Emotional Speech

Upload: dale

Post on 18-Mar-2016

59 views

Category:

Documents


4 download

DESCRIPTION

Emotional Speech. Emotional Speech detection Laurence Devillers, LIMSI-CNRS, [email protected] Expression of emotions in Speech synthesis Marc Schröder, DFKI, [email protected] Humaine Plenary Meeting, 4-6 June 2007, Paris. Overview. Challenge: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

11

Emotional Speech detectionEmotional Speech detectionLaurence Devillers, LIMSI-CNRS, Laurence Devillers, LIMSI-CNRS, [email protected]@limsi.fr

Expression of emotions in Speech synthesisExpression of emotions in Speech synthesisMarc Schröder, DFKI, Marc Schröder, DFKI, [email protected]@dfki.de

HumaineHumaine Plenary Meeting, Plenary Meeting,

4-6 June 2007, Paris4-6 June 2007, Paris

Emotional Speech

Page 2: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

22

OverviewOverviewChallenge: Challenge:

RReal-time eal-time ssystem for “ystem for “real-life” emotional speech real-life” emotional speech detection in order detection in order

to build an affectively competent agentto build an affectively competent agent

Emotion is considered in the broad senseEmotion is considered in the broad sense

Real-life emotions are often shaded, blended, masked Real-life emotions are often shaded, blended, masked emotions due to social aspectsemotions due to social aspects

Page 3: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

33

State-of-the-artState-of-the-art• Static emotion detection system (emotional unit level: Static emotion detection system (emotional unit level: word, chunk, sentence)word, chunk, sentence)• Statistical approach (such as SVM) using large amount of Statistical approach (such as SVM) using large amount of data to train modelsdata to train models• 4-6 emotions detected, rarely more4-6 emotions detected, rarely more

Emotion detection

P(Ei /O)

0: 0bservation

E models

The scheme shows the components of an automatic emotion recognition system The performances on realistic data (CEICES): 2 emotions > 80% 4 emotions >60%

Extractionfeatures

Page 4: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

44

Automatic emotion detection Automatic emotion detection

The difficulty of the detection task increases The difficulty of the detection task increases with the variability of the emotional speech with the variability of the emotional speech expression.expression.

4 dimensions:4 dimensions:

• Speaker (dependent/independent, age, Speaker (dependent/independent, age, gender, health), gender, health), • Environment (transmission channel, noise Environment (transmission channel, noise environment),environment),• Number and type of emotions (primary, Number and type of emotions (primary, secondary)secondary)• Acted/real-life data and applications Acted/real-life data and applications context context

Page 5: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

55

Automatic emotion detection: Research Automatic emotion detection: Research evolutionevolution 20072003

Speakers

Emotion representation

Acted/Woz/real-life data

EnvironmentTransmission

• Speaker-independent: Adaptation to gender

with adaptation

1996

Positive/Negative emotions HMI

Emotion/Unemotion (WoZ)

Primary acted-emotions

Channel-independent

Public place

Speaker-dependentPluri-speaker

Personality,Health, Age,Culture

actors

documentairesjournaux

fictions

TV clips

Phone

Quiet room Channel-dependent

•2- 5 realistic emotions (children, CEICES), HMIReal-life call-center emotions

Emotion in interaction .>5 Real emotions .• >4 acted-emotions

WoZ .Call center data .HMI .

Voice Superposition

Page 6: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

66

Challenge with spontaneous Challenge with spontaneous emotionsemotions

• Authenticity is present but there is no control on the emotion Authenticity is present but there is no control on the emotion • Need to find appropriate labels and measures for annotation validationNeed to find appropriate labels and measures for annotation validation• Blended emotions (Scherer: Blended emotions (Scherer: Geneva Airport  Lost Luggage StudyGeneva Airport  Lost Luggage Study ))

Annotation and Validation of annotationAnnotation and Validation of annotation• Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)• Control of the quality of annotations:Control of the quality of annotations:

•Intra/Inter annotations agreementIntra/Inter annotations agreement• Perception tests Perception tests

•Validate the annotation scheme and the annotationsValidate the annotation scheme and the annotationsPerception of emotion mixtures (40 subjects) NEG/POS valencePerception of emotion mixtures (40 subjects) NEG/POS valenceImportance of the context Importance of the context

•Give measure for comparing human perception with automatic detection.Give measure for comparing human perception with automatic detection.

Page 7: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

77

Human-Human Real-lifeHuman-Human Real-life Corpora Corpora

LIMSILIMSICorpusCorpus

Audio/Audio/Audio-VisuelAudio-Visuel

SizeSize #Speakers#Speakers Emotion classesEmotion classes

Stock Stock exchangeexchange

Call centerCall centerFrenchFrench

4.5h4.5h 100 callers/100 callers/4 agents4 agents

Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse

Financial loanFinancial loan Call centerCall centerFrenchFrench

2h2h 250 callers/ 250 callers/ 2 agents2 agents

Anger, Fear, Anger, Fear, Satisfaction, Satisfaction, ExcuseExcuse

MedicalMedical Call centerCall centerFrenchFrench

20h20h 784 callers/784 callers/7 agents7 agents

20 classes20 classes7 broad-classes7 broad-classes

EmoTabooEmoTabooEmotion Emotion inductioninduction

GameGameFrenchFrench

7h307h30 10 speakers10 speakers > 20 classes> 20 classes

EmoTVEmoTV TV newsTV newsFrenchFrench

< 1h< 1h 100 speakers100 speakers 14 - 35 classes14 - 35 classes7 macro7 macro

SAFESAFEactors actors

MoviesMovies 7h7h 400 speakers400 speakers Fear, other Neg, Fear, other Neg, Pos.Pos.

Audio

AudioVisuel

Page 8: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

88

Context-dependent emotion labelsContext-dependent emotion labelsDo the labels represent the emotion of a considered task or context?Do the labels represent the emotion of a considered task or context?

Example: Real-life emotion studies (call center):Example: Real-life emotion studies (call center):The Fear label represents different expressions of Fear due to different The Fear label represents different expressions of Fear due to different

contexts: contexts: Fear for callers of losing money, Fear for callers for life, Fear for Fear for callers of losing money, Fear for callers for life, Fear for

agents of mistaking agents of mistaking

The difference is not just a question of intensity/activation The difference is not just a question of intensity/activation -> Primary/Secondary fear ? -> Primary/Secondary fear ? -> Degree of Urgency/reality of the threat ?-> Degree of Urgency/reality of the threat ?Fear in the fiction (movies): study of many different contextsFear in the fiction (movies): study of many different contexts

How to generalize ? Should we define labels in function of the type of context?How to generalize ? Should we define labels in function of the type of context?We just defined the social role (agent/caller) as a contextWe just defined the social role (agent/caller) as a context

See Poster of C. ClavelSee Poster of C. Clavel

Page 9: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

99

Emotional labelsEmotional labels• The majority of the detection systems uses emotion The majority of the detection systems uses emotion

discrete representationdiscrete representation• Need a sufficient amount of data. In that objective, we Need a sufficient amount of data. In that objective, we

use hierarchical organization of labels (LIMSI example)use hierarchical organization of labels (LIMSI example)

Coarse level Fine-grained level

(8 classes) (20 classes + Neutral)

Fear Fear, Anxiety, Stress, Panic, Embarrassment

Anger Annoyance, Impatience, ColdAnger, HotAnger

Sadness Sadness, Dismay, Disappointment, Resignation, Despair

Hurt Hurt

Surprise Surprise

Relief Relief

Interest Interest, Compassion

Other Positive Amusement

Page 10: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1010

No bad coders but No bad coders but different perceptionsdifferent perceptions

Combining annotations of different coders: Combining annotations of different coders: a Soft vector of emotionsa Soft vector of emotions

Labeler 1: (Labeler 1: (Major) Annoyance, (Minor) InterestMajor) Annoyance, (Minor) InterestLabeler 2:Labeler 2: (Major) Stress, (Minor) Annoyance (Major) Stress, (Minor) Annoyance

(wM/W (wM/W AnnoyanceAnnoyance, wm/W , wm/W StressStress, wm/W , wm/W Interest)Interest)

For wM=2 , wm=1 ,W=6For wM=2 , wm=1 ,W=6 (0.5 (0.5 AnnoyanceAnnoyance, 0.33 Stress, 0.17 Interest), 0.33 Stress, 0.17 Interest)..

Page 11: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1111

Speech data processingSpeech data processing

WEKA toolkit :(www.cs.waikato.ac.nz - Witten & Franck, 1999)

~200 cues• Prosodic - F0- Formants- Energy

• Micro-•prosody- Jitter- Shimmer…

• Disfluences

• Affect bursts

transcription

WEKA:- attribute Selection

- SVM, ..

Praat

Lu motsPreprocessing• Stemming

N-gramsmodel

combination

LIMSI – see Poster L. Vidrascu Standard features•Pich level, range,•Energy level, range•Speaking rate•Spectral features (formants, Mfccs)

Less standard•Voice quality: local disturbances (jitter/shimmer)•Disfluences (pauses, filler pauses)•Affect bursts

We need to automatically detect affect bursts and to add new features such as voice quality features

Phone signal is not of sufficient quality for many existing techniques

see Ni Chasaide poster

Page 12: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1212

50

55

60

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/N50

55

60

65

70

75

80

85

90

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/NFe:fear, Sd:sadness; Ag:anger; Ax anxi, St:stress, Re relief

LIMSI: Results with paralinguistic cues (SVMs): from 2 to 5 emotion classes (% of good detection)

Page 13: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1313

25 best features 25 best features for 5 emotions detectionfor 5 emotions detection

feature type # of cues in the 25 bests F0 related 4

Energy 5 Microprosody 4

Formants 2 Duration from phonetic alignment 4

Other cues from transcription 6

The difference of the media channel (phone/microphone), the type of data (adult vs. children, realistic vs. naturalistic) and the emotion classes have an impact on the best relevant set of features.

Out of our 5 classes, Sadness is the least recognized without mixing the cues.

Features from all the classes were selected (different from one class to another)

Anger,Fear,Sadness,ReliefNeutral state

Page 14: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1414

Real-life emotional systemReal-life emotional system

System based on acted data -> inadequate for real-life data detection System based on acted data -> inadequate for real-life data detection (Batliner)(Batliner)

GEMEP/CEMO comparison: different emotions GEMEP/CEMO comparison: different emotions First experiments show only an acceptable detection score for Anger.First experiments show only an acceptable detection score for Anger.

Real-life emotion studies are necessaryReal-life emotion studies are necessary

Detection results on call center data: state of the art for « realistic Detection results on call center data: state of the art for « realistic emotions »emotions »

> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions> 80% 2 emotions, > 60% 4 emotions, ~55% 5 emotions

Page 15: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1515

Short-term:Short-term:Acceptable solutions for targeted applications are in reachAcceptable solutions for targeted applications are in reachUse dynamic model of emotion for real-time emotion detection (history Use dynamic model of emotion for real-time emotion detection (history memory)memory)New features: Automatically extracted information on voice New features: Automatically extracted information on voice quality, affect bursts and disfluences from the signal that quality, affect bursts and disfluences from the signal that does not does not require exact speech recognitionrequire exact speech recognition..Detect relaxed/tensed voice (Scherer)Detect relaxed/tensed voice (Scherer)Add contextual knowledge to the blind statistical model: social Add contextual knowledge to the blind statistical model: social role, type of action, regulation (role, type of action, regulation (adapt emotional expression to strategic interaction goals (faces theory, Goffman)).

Long-term Long-term Emotion dynamic processus based on appraisal model.Emotion dynamic processus based on appraisal model.Combining informations at several levels: acoustic/linguistic, Combining informations at several levels: acoustic/linguistic, multimodal cues, adding contextual informations (social role)multimodal cues, adding contextual informations (social role)

Challenges aheadChallenges ahead

Page 16: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1616

Demo (coffee break…)Demo (coffee break…)

Page 17: Emotional Speech detection Laurence Devillers, LIMSI-CNRS,  devil@limsi.fr

L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007

1717

ThanksThanks