speech is more than only its lingvistic content

32
WIKT 2006 WIKT 2006 SPEECH SPEECH IS MORE THAN ONLY IS MORE THAN ONLY ITS LINGVISTIC CONTENT ITS LINGVISTIC CONTENT Institute of Informatics of the Slovak Academy of Sciences Institute of Informatics of the Slovak Academy of Sciences Dubravska cesta 9, 847 05 Bratislava, Slovakia Dubravska cesta 9, 847 05 Bratislava, Slovakia Milan.R Milan.R [email protected] [email protected] Rusko Milan Institute of Informatics of the Slovak Academy of Sciences

Upload: morela

Post on 02-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Institute of Informatics of the Slovak Academy of Sciences. SPEECH IS MORE THAN ONLY ITS LINGVISTIC CONTENT. Rusko Milan. Institute of Informatics of the Slovak Academy of Sciences Dubravska cesta 9, 847 05 Bratislava, Slovakia Milan.R [email protected]. E xpressive speech. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

SPEECHSPEECH IS MORE THAN ONLY IS MORE THAN ONLY ITS LINGVISTIC CONTENTITS LINGVISTIC CONTENT

Institute of Informatics of the Slovak Academy of SciencesInstitute of Informatics of the Slovak Academy of SciencesDubravska cesta 9, 847 05 Bratislava, SlovakiaDubravska cesta 9, 847 05 Bratislava, Slovakia

[email protected]@savba.sk

Rusko Milan

Institute of Informatics of the Slovak Academy of Sciences

Page 2: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

EExpressive speech xpressive speech

“Expressive speech” designates the whole vocal display of a speaker.

It consists:Linguistic information part of information that can be

encoded in general written text message

Various additional information on the speaker – age, cultural background, education, sex, attempt, relation

to the listener, individuality etc. (The expression “individuality” is used here to denote

personality, mood (attitude) and emotions of a speaker.)

Page 3: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

EExpressive speechxpressive speech

C

O

D

I

N

G

LISTENER

(receiver)

Linguistic information

L2

Age A2

Cultural background C2

Education E2

Sex S2

Attempt AT2

Relation to the

listenener

R2

Individuality

- personality

- mood

- emotions

I2

Other Y1

...

Yk

Expresion =>

=> Impression

SPEAKER

(transmitter)

Linguistic information

L1

Age A1

Cultural background C1

Education E1

Sex S1

Attempt AT1

Relation to the

listenener

R1

Individuality

- personality

- mood

- emotions

I1

Other X1

...

Xi

D

E

C

O

D

I

N

G

- - - - >> SPEECH - - SPEECH - - >>

Page 4: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Personality Personality ((and temperamentand temperament))Personality is considered to be a set of constant features of an individual.

Temperament is that aspect of personality that is genetically based, inborn.

..

Ancient Greeks – 2 dimensions of temperament => 4 types of temperament:

• sanguine type (cheerful and optimistic, pleasant to be with)

• choleric type (quick, hot temper, often an aggressive nature)

• phlegmatic type (characterized by slowness, laziness, and dullness)

• melancholy type (sad, even depressed, pessimistic view of world)

Page 5: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Generalized model of personalityGeneralized model of personality

personality p have n dimensions, and so it can be represented by a following vector (Egges, A., Kshirsagar, S., Magnenat-Thalmann, N. [2]:

..

1,0:,1,...1 inT nip (1)

Page 6: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

The OCEAN modelThe OCEAN model„„The Big FiveThe Big Five““ model of model of personalitypersonality

..

Five dimensions are enough to express the personality.

The Big Five model also known as OCEAN model takes into account the following five dimensions of personality:

Openness

Consciousness

Extraversion

Agreeableness

Neuroticism

(Digman, J. M [3], McRae, R.R.; John, O.P. [4])

Page 7: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

Traditional psychological classification of personality dimensions Traditional psychological classification of personality dimensions

Five Factor ModelFive Factor Model [Digman 1990, Mc.Rae, John 1992][Digman 1990, Mc.Rae, John 1992]

Personality dimension Our code [values]

Description High level [1] (example adjectives)

Low level [-1] (example adjectives)

Neuroticism

N [1,0,-1]

Tendency to experience negative thoughts

Sensitive Nervous Insecure Emotionally distressed

Secure Confident

Extraversion

E [1,0,-1]

Preference for and behaviour in social situations

Outgoing Energetic Talkative Social

Shy Withdrawn

Openness to experience

O [1,0,-1]

Open minded-ness, interest in culture

Inventive Curious Imaginative Creative Explorative

Cautious Conservative

Agreeableness

A [1,0,-1]

Interactions with others Friendly Compassionate Trusting Cooperative

Competitive

Conscientiousness

C [1,0,-1]

Organized, persistent in achieving goals

Efficient Methodical Well organized Dutiful

Easy-going Careless

Page 8: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Mood and EmotionMood and Emotion

Mood (attitude) can be defined as a rather static state of being, that is less static than personality and less fluent than emotions. Mood can be defined as one-dimensional (e.g. good or bad mood) or perhaps multi-dimensional (feeling in love, being paranoid etc.)

(Ksirsagar&Magnenat-Thalmann[5])

Page 9: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Generalized model of emotionGeneralized model of emotion

1,0:,1,...1 imTt mie if t>0

and

0Tte if t=0

(2)

An emotional state has a similar structure as personality, but it changes over time.

Defined as an m-dimensional vector, where all m emotion intensities are represented by a value in the interval [0,1] .

The actual emotional state is dependent on the preliminary evolvement of emotins.

A need to model the emotins respecting their previous trends (history).

An emotional state history ωt is defined, that contains all emotional states until et, thus :

tt eee ,...,, 10 (3)

Page 10: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Generalized model of Generalized model of moodmoodEgges continues with defining the individual ITas a triple (p, mt, et), where mt represents the mood of the individual at a time t.

Mood dimension is defined as a value in the interval [-1,1].

k mood dimensions => the mood can be described as follows:

The mood and emotional values are changing in time

=> Both have to be updated regularly.

1,1,...1 im KTt if t>0

and

0Ttm if t=0

(4)

Page 11: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Basic Basic emotionsemotions

Author Basic Emotions

Arnold

Anger, aversion, courage, dejection, desire, despair, fear,

hate, hope, love, sadness

Ekman, Friesen, and Ellsworth

Anger, disgust, fear, joy, sadness, surprise

Frijda Desire, happiness, interest, surprise, wonder, sorrow

Gray Rage and terror, anxiety, joy

Izard

Anger, contempt, disgust, distress, fear, guilt, interest, joy,

shame, surprise

James Fear, grief, love, rage

McDougall

Anger, disgust, elation, fear, subjection, tender-emotion,

wonder

Mowrer Pain, pleasure

Oatley and Johnson-Laird

Anger, disgust, anxiety, happiness, sadness

Panksepp Expectancy, fear, rage, panic

Plutchik

Acceptance, anger, anticipation, disgust, joy, fear, sadness, surprise

Tomkins

Anger, interest, contempt, disgust, distress, fear, joy,

shame, surprise

Watson Fear, love, rage Weiner and

Graham Happiness, sadness

There are many theories of emotions and many different classifications exist.

This table, taken from Ortony, A., Turner, T. J. [6] gives a short overview of basic emotion sets used by different authors.

Page 12: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Placement on emotion dimensionsPlacement on emotion dimensionsPleasureHappy <======> UnhappyPleased <======>AnnoyedSatisfied <======>UnsatisfiedContented <======>MelancholicHopeful <======>DespairingRelaxed <======> BoredArousalStimulated <======> RelaxedExcited <======>CalmFrenzied <======> SluggishJittery <======> DullWide-awake <======>SleepyAroused <======>UnarousedDominanceControlling <======> ControlledInfluential <======>InfluencedIn control <======> Cared-forImportant <======> AwedDominant <======>SubmissiveAutonomous <======> Guided

Semantic differential scales are often used for measuring emotion dimensions.

A Set of dimensions as proposed by Mehrabian & Russell (1974, Appendix B, p. 216)[7].

It is evident that the authors have included moods and personality dimensions in this system too.

Page 13: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Acoustic correlates of emotionsAcoustic correlates of emotions

Problem: speech parameters involved in expression of personality, moods and emotions are shared for all the components of expressivity.

Decoding the expressive speech code is very subjective.

Nevertheless, a general set of the speech parameters responsible for the expression of emotion can be constructed. There are three main categories of speech correlates of emotion:

• Pitch contour

• Timing

• Voice quality

It is believed that value combinations of these speech parameters are used to express vocal emotion.(Schröder M.[8])

Page 14: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Pitch contour Pitch contour Pitch contour is a representation of the intonation of an utterance, which describes the nature of accents and the overall pitch range of the utterance.

Pitch is expressed as fundamental frequency (F0).

One of the most frequently used methods for F0 measurement is the method using autocorrelation function of the LP residual.

Parameters include average pitch, pitch range, contour slope, and final lowering.

Anger Happiness Sadness Fear

Speech rate

Faster Slightly faster

Slightly slower

Much faster

Pitch average

Very much higher

Much higher

Slightly lower

Very much higher

Pitch range

Much wider

Much wider

Slightly narrower

Much wider

Intensity Higher Higher Lower Higher

Pitch changes

Abrupt, down-ward,

directed contours

Smooth, upward

inflections

Downward inflections

Down-ward

terminal inflections

Voice quality

Breathy, chesty tone1

Breathy, blaring

Resonant Irregular voicing

Articu-lation

Clipped Slightly slurred

Slurred Precise

Page 15: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Intonation contourIntonation contourModels of intonation - two main categories:

Phonetic

Phonological

The phonetic models (e.g. Fujisaki model, Tilt model, MOMEL and many others) model the intonation curve. The phonological model (e.g. ToBI) is used to model the speaker's concept of distribution of accents in the intonational phrase.

Page 16: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Automatic intonation contourAutomatic intonation contour anal analyysis in sis in Fujisaki editorFujisaki editor

Page 17: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Pitch contourPitch contour analysis in PRAAT with ToBI labels analysis in PRAAT with ToBI labels

Page 18: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

TimingTimingTiming

Speed that an utterance is spoken

Rhythm

Duration of emphasized syllables

The results of measurement of syllable and phoneme lengths are often given in a form of z-scores

(the instantaneous value is normalized be the mean value of the same elements in the whole database.

Parameters: speech rate, hesitation pauses, exaggeration...

Page 19: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Voice quality Voice quality

Voice quality denotes the overall ‘character’ of the voice, which includes effects such as whispering, hoarseness, breathiness, and intensity.

The voice quality is influenced mainly by:

function of glottis

function of the vocal tract

A detailed classification scheme was published by Laver [9].

Page 20: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

Suprelaryngeal Settings Code Laryngeal Settings Code Longitudinal axis: Labial Labial protrusion Labiodentalization Laryngeal Raised larynx Lowered larynx

LP LD RL LL

Simple phonation types: Modal voice Falsetto Whisper Creak

MV F W C

Latitudinal axis settings: Labial Close rounding Lip-spreading Lingual tip-blade Tip articulation Blade articulation Retroflex articulation Tongue-body Dentalized Palato-alveolarized Palatalized Velarized Pharyngealized Laryngopharyngealized Mandibular Close jaw position Open jaw position Protruded jaw position Retracted jaw position

CR LS TA BA RA DT PA P V PH LPH CJP OJP PJP RJP

Compound phonation types: Whispery voice Whispery falsetto Creaky voice Creaky falsetto Whispery creak Whispery creaky voice Whispery creaky falsetto Breathy voice Harsh voice Harsh falsetto Harsh whispery voice Harsh whispery falsetto Harsh creaky voice Harsh creaky falsetto Harsh whispery creaky voice Harsh whispery creaky falsetto

WV WF CV CF WC WCV WCF BV HV HF HWV HWF HCV HCF HWCV HWCF

Velopharyngeal settings: Nasal Denasal

N DN

Overall muscular tension settings:

Tense voice Lax voice

TV LV

Page 21: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Analysis of the glottal functionAnalysis of the glottal functionThe analysis of the glottal function is generally done using source-filter model of

speech production [10].

The glottal function is obtained from the speech signal by inverse filtering. One of the most efficient inverse filtering methods uses Discrete Linear Prediction – DLP (El-Jaroudi A., Makhoul J., [11])

to obtain the inverse filter coefficients and to filter the speech signal.

The resultant DLP residual function is considered as a representative of a derivative of glottal volume velocity function.

Page 22: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

TimeTime and spectral and spectral domain characteristics domain characteristics of the glottal function of the glottal functionTime characteristicsOQ, Open Quotient – ratio of the open phase of the glottal waveform to the period of the pulse. OQ predicts the values for the amplitudes of the lower harmonics. (increased value of OQ is correlated with an

increase in the amplitude of the lower harmonics in the voice spectrum.)

CQ, Closing Quotient – ratio of the closing phase of the glottal pulse to the period of the pulse.These characteristics has been recently often replaced by AQ – Amplitude quotient and NAQ-Normalized amplitude

quotient (Alku [12]).

EE, Excitation Strength – amplitude of the negative peak, calculated after the positive peak. EE is correlated with the overall intensity of the signal. A decrease in EE is correlated with a breathy voice.

RK, Glottal Symmetry/Skew – ratio of the closing phase to the opening phase of the differentiated glottal pulse. RK affects mainly the lower harmonics; the more symmetrical the pulse, the greater their amplitude.

Spectral characteristicsH1-H2– the amplitude of the first harmonic (H1) compared to the amplitude of the second harmonic (H2). An indicator

of the relative length of the opening phase of the glottal pulse (Hanson 1997).

H1-A1– the amplitude of the first harmonic (H1) compared to the strongest harmonic in the first formant (A1). Reflects the first formant bandwidth

spectral tilt - Expected to be large and positive for breathy voices and small and/or negative for creaky voices

H1-A2– the amplitude of the first harmonic (H1) compared to the amplitude of the strongest harmonic in the second formant (A2). An indicator of spectral tilt at the mid formant frequencies. Large and positive for breathy voices and small and/or negative for creaky voices.

H1-A3– the amplitude of the first harmonic (H1) compared to the amplitude of the strongest harmonic in the third formant (A3). An indicator of spectral tilt at the higher formant frequencies. Large and positive for breathy voices and small and/or negative for creaky voices.

Page 23: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Glottal pulse analysisGlottal pulse analysis in APARAT in APARAT

Page 24: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Analysis of the vocal tractAnalysis of the vocal tract

Methods of vocal tract shape estimation include x-ray, computer tomography and magnetic resonance methods.

- stationary sound production only

.Cheaper and quicker method – computing of the vocal tract shape from the speech signal

complementary to glottal pulse analysis from the speech signal. (e.g. vocal tract shape computation from LPC derived reflection coefficients).

- allows for analysis of the dynamic behavior of the articulators. Similar information can be obtained by formant analysis using homomorphic deconvolution (cepstrum) or LPC spectrum analysis.

Page 25: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Static aStatic analysis by synthesis using articulatory synthesizernalysis by synthesis using articulatory synthesizer (TRACTSYN)(TRACTSYN)

Page 26: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Dynamic aDynamic analysis by synthesisnalysis by synthesis ( (articulatory syntharticulatory synth. TRACTSYN). TRACTSYN)

Page 27: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Acoustic correlates of Acoustic correlates of emotions applied in emotions applied in speech synthesisspeech synthesis

Emotion Study

Language Rec. Rat

Parameter settings

Joy Burkhardt & Sendlmeier

(2000) German

81% (1/9)

F0 mean: +50% F0 range: +100%

Tempo: +30% Voice Qu.: modal or tense; “lip-spreading

feature”: F1 / F2 +10% Other: “wave pitch contour model”: main

stressed syllables are raised (+100%), syllables in between are lowered

(-20%)

Sadness Cahn (1990)

American English

91% (1/6)

F0 mean: “0”, reference line “-1”, less final lowering “-5”

F0 range: “-5”, steeper accent shape “+6” Tempo: “-10”, more fluent pauses “+5”,

hesitation pauses “+10” Loudness: “-5”

Voice Qu.: breathiness “+10”, brilliance “-9”

Other: stress frequency “+1”, precision of articulation “-5”

Anger

Murray & Arnott (1995)

British English

F0 mean: +10 Hz F0 range: +9 s.t.

Tempo: +30 wpm Loudness: +6 dB

Voice Qu.: laryngealisation +78%; F4 frequency -175 Hz

Other: increase pitch of stressed vowels (2ary: +10% of pitch range; 1ary:

+20%; emphatic: +40%)

Fear Burkhardt & Sendlmeier

(2000) German

52% (1/9)

F0 mean: “+150%” F0 range: “+20%” Tempo: “+30%”

Voice Qu.: falsetto

Surprise Cahn (1990)

American English

44% (1/6)

F0 mean: “0”, reference line “-8” F0 range: “+8”, steeply rising contour slope

“+10”, steeper accent shape “+5” Tempo: “+4”, less fluent pauses “-5”,

hesitation pauses “-10” Loudness: “+5”

Voice Qu.: brilliance “-3”

Boredom Mozziconacci

(1998) Dutch

94% (1/7)

F0 mean: end frequency 65 Hz (male speech)

F0 range: excursion size 4 s.t. Tempo: duration rel. to neutrality: 150% Other: final intonation pattern 3C, avoid

final patterns 5&A and 12

Page 28: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Vision:Vision: Speech Sound Mining Speech Sound Mining

Aim: to extract information from supra-segmental and extra-linguistic layers

Where to look for information: time domain a) quantity (lengths of segments)

b) rhythmfrequency domain

a) long term characteristicsb) short term characteristics

model based characteristicsa) glottal excitation function b) articulatory model

Page 29: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Vision:Vision: Speech Sound Mining Speech Sound MiningHow to define a set of speech sound objects?

Objective methods of analysis (pattern recognition)

Subjective methods (impression of the listener)

Possible objects:

Speech sound event

Speech sound act

Speech sound gesture

Speech sound characteristic

Speech sound characteristic change

Page 30: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Vision:Vision: Speech Sound Mining Speech Sound MiningFirst steps to be accomplished:

Speech corpus buildingAnnotation of SSOBoundary markers Frequencies of occurence of SSOConcordances of SSOCorrelation among different sets of objects (pitch SSO, accent SSO, rhythmic SSO, timbre SSO, etc.)Semantic representation of SSOCross cultural semantic analysis

Page 31: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Vision:Vision: Speech Sound Mining Speech Sound MiningTraditional methods used in NLP and data mining will be applicable:

Bag of words Bag of SSOWordNet SSO semantic nete.t.c.

Research on the relation between lingvistic and paralingvistic&extralingvistic information.

Creation of a complex (holistic) model of the speech signal as an information carrier in communication.

Page 32: SPEECH  IS MORE THAN ONLY ITS LINGVISTIC CONTENT

WIKT 2006WIKT 2006

Thank you for your attentionThank you for your attention

MMilanilan Rusko Rusko

Institute of Informatics Institute of Informatics Slovak Academy of SciencesSlovak Academy of Sciences

[email protected]@savba.sk