emotions and speech: some acoustical correlates · corded speech material. since we believed that...

13
Emotions and Speech: Some Acoustical Correlates CARL E. W•LLL•MS Naval A erospace Medical Research Laboratory, Pensacola, Florida32512 KENNETH N. STEVENS Massachusetts Institut• o] Technology, Cambridge, Massachusetts 02139 (Received 14 March 19;/2) This paperdescribes some further attemptsto identify and measure those parameters in the speech signal that reflect the emotional state of a speaker. High-quality recordings were obtainedof professional "method" actors reading the dialogue of a short scenario specifically written to contain various emotional situations. Excerpted portions of the recordings were subjected to both quantitativeand qualitative analyses. A comparison was alsomadeof recordings from a real-lifesituation, in which the emotions of a speaker were clearly defined, with recordings from an actor whosimulated the same situation. Anger, fear, and sorrow situations tendedto produce characteristic differences in contour of fundamental frequency, average speech spectrum, temporalcharacteristics, precision of articulation, and waveform regularityof successive glottal pulses. Attributes for a given emotionalsituation were not always consistent from one speakerto another. Sm;J•Cr Ctatss[mcnrlo•: 9.5, 9.3. INTRODUCTION Long beforethe availabilityof modern instrumenta- tion for the analysis of speech, researchers attempted to identify and measure those parameters in the speech signal that reflect the emotionalstate of a speaker. This paper describes oneof a series of studies t-a under- taken in a further attempt to delineate some of the acoustic correlates of the emotions of a speaker. There are two principal reasons why it is of interest to examine the parameters in the speech signal that are related to the emotional state of a speaker. (1) There are situations in whichthe physiological and emotional state of an individual needs to be monitored, and it would be convenient if an indication of this state could be obtainedthroughanalysis of the acoustic character- istics of his utterances. (2) Studies of speech attributes related to emotional state may help to contribute to- ward a general theory of speech performance. Such a theoryshould have two components: one that specifies the acoustic correlates of the linguistic units usedfor communication between speakers of a givenlanguage, and the otherthat describes the extralinguistic aspects of speech communication. I. APPROACH In planning the study, two approaches were con- sidered:(1) a detailedanalysis of "field" recordings 1238 Volume 52 Number 4 (Pert 2) 1972 where there would be no question as to the emotion present in the individual speaking, and (2) an analysis of high-quality recordings of professional actors simu- lating various emotions. The second approach was selected for the major portion of our work sinceit seemed to afford the best opportunity for obtaining good recordings that could be subjected to bothquanti- tative and qualitative analyses. Because actors are presumablyable to portray clear and unambiguous emotions, theirutterances provide a means for exploring the basic manifestations of emotionalspeech. Field recordings often reflect the simultaneous presence of several emotions and the lack of control of the speech material. Whilean approach using actors is not novel, •-7 investigators employing it have never performed, to our knowledge, a spectrographic analysis of the re- corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations involv- ing emotional interaction among severalpeople, the decision wasmadeto make use of a short play. Getting the actors involvedin clearlydefined situations would, hopefully, resultin their experiencing and expressing the various emotions to be studied. The primary func- tion of the play wasto elicit the desired emotions from the actors and to serve asthe carrier for selected phrases and sentences to be embedded in different emotional situations. These phrases and sentences, so-called

Upload: others

Post on 16-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

Emotions and Speech: Some Acoustical Correlates

CARL E. W•LLL•MS

Naval A erospace Medical Research Laboratory, Pensacola, Florida 32512

KENNETH N. STEVENS

Massachusetts Institut• o] Technology, Cambridge, Massachusetts 02139

(Received 14 March 19;/2)

This paper describes some further attempts to identify and measure those parameters in the speech signal that reflect the emotional state of a speaker. High-quality recordings were obtained of professional "method" actors reading the dialogue of a short scenario specifically written to contain various emotional situations. Excerpted portions of the recordings were subjected to both quantitative and qualitative analyses. A comparison was also made of recordings from a real-life situation, in which the emotions of a speaker were clearly defined, with recordings from an actor who simulated the same situation. Anger, fear, and sorrow situations tended to produce characteristic differences in contour of fundamental frequency, average speech spectrum, temporal characteristics, precision of articulation, and waveform regularity of successive glottal pulses. Attributes for a given emotional situation were not always consistent from one speaker to another. Sm;J•Cr Ctatss[mcnrlo•: 9.5, 9.3.

INTRODUCTION

Long before the availability of modern instrumenta- tion for the analysis of speech, researchers attempted to identify and measure those parameters in the speech signal that reflect the emotional state of a speaker. This paper describes one of a series of studies t-a under- taken in a further attempt to delineate some of the acoustic correlates of the emotions of a speaker.

There are two principal reasons why it is of interest to examine the parameters in the speech signal that are related to the emotional state of a speaker. (1) There are situations in which the physiological and emotional state of an individual needs to be monitored, and it would be convenient if an indication of this state could

be obtained through analysis of the acoustic character- istics of his utterances. (2) Studies of speech attributes related to emotional state may help to contribute to- ward a general theory of speech performance. Such a theory should have two components: one that specifies the acoustic correlates of the linguistic units used for communication between speakers of a given language, and the other that describes the extralinguistic aspects of speech communication.

I. APPROACH

In planning the study, two approaches were con- sidered: (1) a detailed analysis of "field" recordings

1238 Volume 52 Number 4 (Pert 2) 1972

where there would be no question as to the emotion present in the individual speaking, and (2) an analysis of high-quality recordings of professional actors simu- lating various emotions. The second approach was selected for the major portion of our work since it seemed to afford the best opportunity for obtaining good recordings that could be subjected to both quanti- tative and qualitative analyses. Because actors are presumably able to portray clear and unambiguous emotions, their utterances provide a means for exploring the basic manifestations of emotional speech. Field recordings often reflect the simultaneous presence of several emotions and the lack of control of the speech material. While an approach using actors is not novel, •-7 investigators employing it have never performed, to our knowledge, a spectrographic analysis of the re- corded speech material.

Since we believed that the emotions of interest might best be described in terms of specific situations involv- ing emotional interaction among several people, the decision was made to make use of a short play. Getting the actors involved in clearly defined situations would, hopefully, result in their experiencing and expressing the various emotions to be studied. The primary func- tion of the play was to elicit the desired emotions from the actors and to serve as the carrier for selected phrases and sentences to be embedded in different emotional

situations. These phrases and sentences, so-called

Page 2: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

EMOTIONS IN SPEECH

"control clusters," could then later be subjected to de- tailed acoustical analyses. Because changers in the speech signal due to the presence of some emotions might be subtle, it was considered necessary 'to be able to compare utterances of identical material as they occurred in different emotional situations.

A detailed scene-by-scene outline of a slhort play involving three male characters was constructed. The outline was then given to a playwright who wrote the dialogue for the three characters who would be speak- ing in various situations. For each character, the play- wright was instructed to include identical speech material--the control clusters--as part of the. dialogue for that character speaking in different situ•tions. In addition to analysis of the speech material in the con- trol clusters, longer portions of the dialogue, usually several sentences surrounding a control cluster, were selected for study.

The services of a professional director and three pro- fessional actors were employed in recording the scenario. All four individuals were former members of the Actor's

Studio in New York and had past experience with the so-called "method" style of acting. The recordings were made in a professional recording studio, with the actors rotating in each of the three roles.

In addition to the analysis of recordings of the short play, this study included a comparison of recordings from a real-life situation in which the emotions of the

speaker were clearly defined, and recordings from an actor who simulated this situation. The purpose of this subsidiary investigation was to attempt to validate the use of actors to simulate emotions, as well as to obtain additional data on the acoustic correlates of various emotions.

II. SELECTION OF PARAMETERS FOR ANALYSIS

On the basis of the results of a preliminary study a and an examination of the literature on the physiological and acoustic correlates of emotion, only a few kinds of acoustic parameters were selected for detailed analysis in the present study. Before the data are presented, a brief review is given of the acoustic correlates upon which attention was focused, indicating reasons for selecting particular kinds of data.

Studies of the effects of emotion on the acoustic

characteristics of speech have shown thai: average values and ranges of fundamental frequency (F0) differ from one emotion to another. • In the preliminary study, a the acoustic properties that appeared to be among the most sensitive indicators of emotion were attributes

that specified the contour of F0 throughout an utter- ance. There are several reasons why changes in F0 with time are potentially capable of providing information concerning the emotional state of a speaker. First, con- siderable latitude is possible in the variations of Fo, since only certain aspects of the F0 contour carry information

with regard to the linguistic content of a message. The principal linguistic functions of F0 changes are to in- dicate stress, and to mark boundaries of different types of sentence-length or phrase-length units. Subject to these constraints, a speaker is relatively free to use changes in F0 to convey nonlinguistic information, such as his emotions, or to convey special emphasis of some kind. Furthermore, the fundamental frequency can undergo variations that may not be intended or be under overt control of the speaker, and hence may provide an indication of the speaker's emotional state.

Further justification for the preoccupation with F0 contours as indicators of emotions comes from a study of the literature on the physiological correlates of stress. It has been stated, for example, that "... respiration is frequently a sensitive indicator in certain emotional situations, especially startle, conscious attempts at deception, and conflict. The respiratory pattern is fre- quently disturbed in anxiety states. "8 An increase in respiration rate would presumably result in an in- creased subglottal pressure during speech. This height- ened subglottal pressure would give rise to a higher F0 during voiced sounds in speech. The increased respiration rate could also lead to shorter durations of speech between breaths, with a consequent effect on the basic temporal pattern of speech.

Other relevant physiological effects of certain emo- tions are dryness of the mouth often observed under conditions of emotional excitement, anticipation, fear, and anger, and tremor and disorganization of motor response, observed under conditions of emotional con- flict. These effects can have an influence on various

components of the speech system, including the larynx, which is directly involved in the control of Fo. Muscle activity in the larynx and the condition of the vocal cords are likely to have a more direct influence on the sound output and, in particular, on the fundamental frequency, than changes in muscle activity in other parts of the speech generating system, such as the tongue, lips, and jaw. The reason is that the vibrating vocal cords have a direct effect on the volume velocity through the glottis, whereas the other muscles and vocal-tract components simply shape the resonant cavities for sound that is generated at the vocal cords. Thus, any analysis of the speech signal that reflects vocal-cord activity is more likely to be influenced by physiological changes brought about by the emotional state of the speaker.

Such physiological changes as increased subglottal pressure, excessive dryness or sailration, and decreased smoothness of motor control can have an influence on

the waveform of the pulses from the vocal cords, as well as on their frequency. For example, increased sub- glottal pressure generally gives rise to a narrowing of individual glottal pulses, and hence to a change in the spectrum of the pulses. Under some circumstances, such as excessive salivation, there may be irregularities

The Journal of the Acoustical Society of America 1239

Page 3: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

4OO

•00

•00

I00

0

2OO

I00

o

3oo

-r I00 >" o

• 400 • •00 h 200

•00

0

300

200

I00

0

3O0

2OO

I00

0

NEUTRAL

SORROW

o 05 i.o

WILLIAMS AND STEVENS

ANGER

I I I

FEAR

4

I I 15

TIME (SECONOS) 0.5

Fro. 1. Fundamental fre-

quency (Fo) contours of selected control clusters as spoken by Voice B in different situations in the scenario. Numbers refer to control clusters.

in the waveform of the glottal output from one pulse to the next. In the present study, a rough indication of glottal waveform was obtained from the average acoustic spectrum of an utterance, measured by means of an octave-band analyzer. Qualitative evidence of changes in glottal waveform and of irregularities in successive glottal pulses was obtained by observation of wide-band (300-Hz filter) spectrograms.

Although primary attention in this study was focused on acoustic parameters related to laryngeal activity, some observations were also made on parameters that reflect control of the supralaryngeal structures. These included measurements of the rate of talking, analysis of acoustic attributes that provide an indication of the precision with which articulatory targets for certain vowels and consonants are achieved, and spectrographic observations of vowel formant frequencies and of intensity of release for stop consonants.

III. RECORDINGS OF SCENARIO

Although a vast amount of data was made available by the various recordings of the scenario, only selected samples will be presented in order to demonstrate some of the ways in which the data were examined. The data to be reported include both quantitative results, which were obtained by making various measurements from wide- and narrow-band spectrograms and graphic level recordings, and qualitative observations, which consisted simply of setting down impressions derived

1240 Volume 52 Number 4 (Part 2) 1972

from visual examination of a number of spectrographic patterns.

Figure 1 shows F0 contours of selected control clusters as uttered by one of the three voices, Voice B, in differ- ent situations in the scenario. These contours were ob-

tained by tracing harmonics in narrow-band spectro- grams of the utterances. For neutral utterances, the changes in F0 were relatively slow, and the shape of the contour throughout each utterance was smooth and continuous.

The contour shapes for utterances produced in anger situations showed an F0 that was generally higher throughout the utterances, suggesting that they were generated with greater emphasis. Furthermore, one or two syllables in each phrase were characterized by large peaks in F0, again indicating strong emphasis on these syllables. Although the excursions in F0 were quite great, there appeared to be a relatively smooth overall contour with one or two major peaks, but with no large discontinuities.

The contours for utterances made in situations in-

volving the emotion sorrow were relatively flat with few fluctuations, and the F0 was usually lower than that for neutral situations. For Voice B (Fig. 1), there was a slowly falling contour during the first half of the utterance, and a more level contour toward the end. Only rarely was emphasis placed on syllables in utter- ances by any of the three voices in a sorrow situation.

The contours for utterances made in fear situations

often departed from the prototype shape for neutral

Page 4: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

EMOTIONS IN SPEECH

situations. Occasionally there were rapid up-and-down fluctuations within a voiced interval, as in cluster 4 for Voice B. Sometimes sharp discontinuities were noted from one syllable to the next.

From narrow-band spectrogra]ns prepared of the longer speech samples and control clusters, measure- ments of F0 were obtained every ½).15 sec, and distribu- tion curves were drawn to determine the median F0 and a measure of the range of F0 (10th 90th percentile). Figure 2 shows, for each of the three voices (A,B,C), the median F0 and range of F0 (on a logarithmic scale) for these long speech samples. The lowest F0 was ob- tained for the emotion sorrow, and the highest for anger. lXleasurements for the neutral and fear situations were

very similar for Voices B and C, but the fear situations showed a wider and higher range of F0, and the dis- tribution for fear was somewhat skewed, with occasimml

300-

200-

150-

I00 -

70-

VOICE A VOICE B VOICE C

S

S S

S - SCRROW f - FEAR

N - NEUTRAL A - ANGER

l"m. 2. Median Fo and range of Fo for each of the three voices speaking in neutral, sorrow, fear, and anger situations.

excursions to very high values of F0. The widest range of F0 (on a linear frequency scale) tended to occur for

Situation: Neutral (3) Situation:

1

Neutral (9)

Situation: Anger (19) Situation: Fear (23)

Situation: Sorrow (58)

'1 ! I •4 *

tj ( ,

i",' ' .

0:8 m.O 1.2 1.4 m.6 1.8 2•0 2•2 • TIME (SECONOS)

Fro. 3. Wide-band spectrograms of the cluster "For God's sake" spoken by Voice B in five situations.

The Journal of the Acoustical Society of America 124

Page 5: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

WILLIAMS AND STEVENS

3.5-

3.0-

25-

2.0-

0.5-

C

Situation: Neutral (3) Situation: Neutral (9)

Situation: Anger (19) Situation: Fear (23)

Situation: Sorro• (58)

3.0-

2.",-

2.0-

0

0 0.2 0.4 0.6 0.8 I,O I.Z 1.4 1.6 1.8 2.0 2 2

TiME (SECONDS)

Fro. 4. Narrow-hand spectrograms of the cluster "For God's sake" spoken by Voice B in five situations.

anger. Ranges for sorrow and neutral were very similar, but median values of F0 for the neutral situations were

higher than those for sorrow. Figure 3 shows wide-band spectrograms of one control

cluster spoken by Voice B in five situations. For this cluster the total duration was least for the neutral

situation, and greatest for sorrow. The lengths of the utterances for fear and for anger were about the same. The increases in duration for the utterances made in

anger, fear, and sorrow situations came in part from increases in vowel durations, but primarily from lengthened intervals of closure or vocal-tract constric- tion for the consonants. Comparison of the initial con- sonant in the word "God's" in the various spectrograms shows a longer stop gap and a more intense burst at the consonantal release for the various emotional situations than for the neutral situation.

1242 Volume 52 Number 4 (Part 2) 1972

The vocal-cord vibrations for the utterance ex-

emplifying sorrow appeared to have considerable fluctu- ations in shape from one glottal pulse to the next. This voicing irregularity was manifested by a variation in darkness of individual voicing pulses, particularly in the high-frequency region above 2000 Hz. This effect is particularly evident in the word "God's" at the bot- tom of Fig. 3 (0.8-1.3 sec on the time scale), in the entire frequency region above 2000 IIz. The spectro- grams for the anger situation also demonstrate some anomalies that were presumably due to irregularities in the glottal output. For example, the pattern of giottal vibration is not uniform in the words "God's" and "sake," as evidenced by the irregular spectral pattern at high frequencies (above 3 kHz).

The narrow-band spectrograms shown in Fig. 4 for the same utterances, again indicate that the highest

Page 6: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

EMOTIONS IN SPEECH

Situation

!

Neutral (2)

bl I iIt , I

ß ,-, ,,,,,, ,, ,?,,,,, , *,,, ,,,,,,,,,, .... , Situation: Anger {17) Situation:

3-

2-

1-

0 i 0

Situation: Sorrow (56)

TIME (SECONDS)

Fla. 5. Wide-band spectragrams of the cluster "I don't understand it" spoken by Voice A in four situations.

average frequency and the most rapid frequency changes occurred in the anger situation; there were sharp rises in F0 in the second and third syllables. Rapid rises in F0 also occurred in the neutral situation (cluster location number 3, upper left of Fig. 4), but the rising contour was smoother, and the peak F0 waa lower. The contour for the fear situation also showed some rapid fluctuations or tremor in F0; in the middle syllable of utterance number 23, the F0 starts high at the beginning of the syllable, and then fails, with an initial bump or fluctuation. This kind of contour seemed to occur often, but not always, in fear situations.

Measurements of the formant frequencies for partic- ular vowels uttered in the various emotional situations

showed some small differences. Of particular interest is the fact that, for anger situations, vowels in the syl- lables uttered with emphasis had higher first-formant

frequencies than the corresponding vowels in neutral situations. Apparently a wider mouth opening was used in the emphatic anger situations, and this yielded a higher first-formant frequency. An example is the first formant of the word "God's" in Fig. 3, which was higher for the anger situation than for the neutral situation.

The effects of the various emotions on the utterances of Voice A were qualitatively similar to those on Voice B, as exemplified by the wide- and narrow-band spec- trograms in Figs. 5 and 6, respectively. The neutral utterance (Fig. 5) shows a uniform formant structure and glottal vibration pattern, whereas irregularities in formant amplitudes and amplitudes of successive giottal pulses at high frequencies are evident for sorrow, anger, and fear. In the stressed vowel of the word "under- stand," the amplitudes of the second and third for-

The Journal of the Acoustical Society of America 1243

Page 7: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

WILLIAMS AND STEVENS

2.5-

2.0-

15-

I0-

0.5-

0

Situation: Neutral (2)

Situation: Anger (17) Situation: Fear (21)

•.5-

2."',-

2.0-

1.5.

0

0.6 0.8

Situation: Sorrow (56)

o.• 0.4 ,.o ,.z ,.,, & ,'., 2•o TIME (SECONDS)

FIO. 6. Narrow-band spectrograms of the cluster "I don't understand it" spoken by Voice A in four situations.

mants relative to that of the first formant appear to be greater for the anger and fear situations than for the neutral utterance, presumably reflecting a change in the spectrum of the glottal pulses. The consonant closures seem to be better defined in the anger and fear spectro- grams than in the neutral one, since the intensity changes at the consonantal closures and releases are more abrupt; the durations of these particular utter- ances, however, are not significantly different from that of the neutral utterance. The narrow-band spectrogram for the fear situation (Fig. 6) again indicates some tremors and rapid fluctuations in F0.

Mean rates of articulation in syllables per second were determined for each of the three actors' utterances of

certain longer speech samples selected from points in the scenario where the emotional situation was clearly defined. Table I shows the mean rate of articulation, in

1244 Volume 52 Number 4 (Part 2) 1972

syllables per second, for each of the three voices speak- ing in neutral, anger, fear, and sorrow situations. The ranking of the emotions according to rate of articula- lation, from fastest to slowest, was the same (with one exception) for each of the three voices: neutral, anger, fear, and sorrow. The rate of articulation obtained in sorrow situations was less than half that found for the

other situations. This finding is in agreement with the results of a study by Fairbanks and Hoaglin, a which involved the use of amateur actors to simulate different

emotional states; they found marked decreases in rate for grief as compared with anger, sorrow, and indifference.

Measurements of the average spectrum of the speech signal were made for a number of the control clusters in the scenario. The spectrum was obtained by means of an octave-band analyzer, and the outputs of in-

Page 8: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

FIG. 7. Average spectra of selected control clusters as spoken by each of the three voices in different situations in the scenario. These spectra were obtsJned by taking the long-term average output from octave tiltcm. (Spectra of Voice C in neutral and sorrow situations were identical.)

EMOTIONS IN SPEECH

I0

õ

0

-5

-I.5

io

-Iõ

I0

õ

0

-S

-I0

-15

VOICE A

VOICE B

c -- NEUTRAL

I I • • SORROW I I , i o----o ANGER I I

VOICE C •---w FEAR

- - L -

125 250 500 IOOO ZOOO 4000

OCTAVE BAND CENTER FREQUENCY [Hz]

dividuM filters were averaged over an interval of several seconds.

Figure 7 shows some of the results. The absolute level for each spectrum was normalized by setting the level for the 250-Hz octave band arbitrarily at 0 dB. Octave- band spectra of these kinds demonstrate two kinds of gross acoustic effects. The low-frequency bands (125 and 250 Hz) are in a frequency region occupied by F0 during voiced sounds. If F0 is low (in the range of 90- 175 Hz), then it remains in the 125-Hz octave band, and hence there is appreciable average energy in this band. If Fo remains high most of the time (above 175 Hz), then there is less energy in the 125-Hz band. Hence, the relative levels in the 125- and 250-Hz bands provide a very rough indication of the average funda- mental frequency.

The spectra in Fig. 7 show relatively less energy in the 12S-Hz band for utterances made in anger situations (for which the F0 is high) than for utterances made in neutral situations. For the one voice where data cor- responding to the emotion fear were obtained (Voice C), there is also relatively less low-frequency energy for that emotion. Utterances made in sorrow situations do

not differ in a consistent way from utterances for

neutral situations, at least as far as the low-frequency energy is concerned. Consistent differences among emotions are also observed at the high-frequency end of the spectrum. The level at high frequencies (above 1000 Hz) relative to low frequencies is always greatest for the emotion anger and least for sorrow. The impli- cation is that anger is manifested in a higher subglottal pressure and a narrower glottal pulse, while the reverse is true of sorrow.

The changes in average spectrum for the emotion fear, together with the incidental observation from spectro- grams of a decrease in low-frequency energy (in a range up to, say, 500 Hz) relative to energy at higher

T•,nL• I. Mean rate of articulation (syllables per second)for each of the three voices speaking in different situations.

Neutral Anger Fear Sorrow

Voice A 4.03 4.26 3.92 1.84 Voice B 4.89 4.32 3.90 2.03 Voice C 4.02 3.88 3.57 1.86

Mean 4.31 4.15 3.80 1.91

The Journal of the Acoustical Society of America 1245

Page 9: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

4.0'

$5'

3.0-

2.5-

2.0-

1.5'

1.0-

WILLIAMS AND STEVENS

Well here it comes, ladies and gentlemen

•.• BO- o • ) 2.5-

05-

a terrific crash, ladies and aentlemen

40,

35,

2.5-

2.0.

1.0-

0.5-

I, I can't talk, ladies and gentlemen

I I II I i I I I 0.2 0.4 0.6 0 8 i 0 1.2 1.4 L6 1.8 2 0 2.2

TIME (SECONDS)

Fit}. 8. Radio announcer speaking before (top) and after (middle and bottom) the crash of the HINDENBURG. Narrow-hand. (From Williams and Stevens, Ref. 1.)

frequencies, are consistent with some measurements of average spectra of a Soviet cosmonaut under various critical situations during a space flight? The situations where the emotion fear was expected to be greatest in space flight also led to an upward shift in the centroid of the spectrum within the frequency range 300-1200 Hz.

IV. COMPARATIVE ANALYSIS OF ACTUAL AND

SIMULATED DESCRIPTIONS OF

HINDENBURG DISASTER

In order to provide some justification for the use of actors in studies of the acoustic correlates of emotion, some recordings were obtained of utterances spoken in real-life situations in which the circumstances gave a

1246 Volume 52 Number 4 (Part 2) 1972

clear indication of the emotions being experienced by the speakers. The object was to compare the acoustic characteristics of the voices in these situations with the characteristics in the voices of the actors in similar

situations. One recording that was acquired was that of the radio announcer who was describing the ap- proach of the HINDENBURG at Lakehurst, New Jersey, when the Zeppelin suddenly burst into flames. The announcer continued his description (with one or two short breaks) throughout the disaster.

Figure 8 shows three narrow-band spectrograms that were made from excerpts of the announcer's voice be- fore and after the crash occurred. The announcer's normal voice had a great deal of inflection, as indicated by the smooth up-and-down movements of F0 in the upper spectrogram. The shape of the contour changed

Page 10: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

40-

$.5-

•.0-

2.5-

2.0-

1.5-

0.5-

EMOTIONS IN SPEECH

here it comes, ladies and gentlemen

4.0-

•o 1.5

05

terrific crash, ladies and gentlemen

40.

3.5.

3.0.

25.

2.0-

1.0-

0.5-

I can't talk, ladies and gentlemen

01.2 I '" I I I I & •l I m I I I m 0.4 O.S o.e i.o m.a •.4

TIME (SECONDS)

Fro. 9. Voice C simulating announcement of 111NDENBURG crash before (top) and after (middle and bottom) the disaster. Narrow-band.

abruptly immediately after the crash, as the spectro- grams in the nfiddle and lower portion of the figure demonstrate. The average F0 in lhose portions is con- siderably higher, and there is apparently much less fluctuation in frequency. Quantitative analysis of longer samples of the announcer's speech has indicated, how- ever, that there was, in fact, a greater fundamental- frequency range after the disaster than before. There are some irregular bumps in the contour, which might be interpreted as a kind of tremor; examples of these irregularities can be seen in the bottom spectrogram of Fig. 8, near 0.6 sec and again near 0.8 sec. The irregularities may reflect a loss of precise control of musculature and an irregular respiratory pattern.

The actor designated as Voice C was provided with a transcription of the radio announcer's words (but not the actual recording) and was asked to study and then

to read the script as if he were the radio announcer describing the event. He reported that he had never heard the recorded radio account of the disaster. A

recording was made of Voice C's "simulated" descrip- tion of the events before and after the crash. Spectro- grams taken from this simulation are shown in Fig. 9. The excerpts are the same as those examined for the radio announcer, shown in Fig. 8.

Voice C on the before-the-crash recording had less inflection than the announcer, but the smooth fluctua- tions in his F0 followed the expected form. Following the crash, there was an increase in fundamental frequency, and there were erratic changes in F0 throughout a phrase. Voice C showed greater up-and-down fluctua- tions than did the radio announcer, but the irregular bumps and atypical contours are reminiscent of the spectrograms for the emotion fear in the scenario (Figs.

The Journal of the Acoustical Society of America 1247

Page 11: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

WILLIAMS AND STEVENS

4.0-

3.5-

•.0-

2.5-

2.0-

1.0

-r 0.5

z

040'

• •.5.

2.5-

2.0-

1.5-

1.0-

0.5-

0

It's crashing, it's crashing terrible

It's crashing terrible

0.•' 0.4 01.6 01.5 '!0 ' •' '.4 , '.8 •'.0 •m.2 TIME (SECONDS}

Fia. 10. Radio announcer (top) speaking during the crash of the HINDENBURG and Voice C (bottom) simulating the same portion of the announcement. Narrow-band.

1, 4, 6). The similarities, as well as the differences, between the radio announcer's voice during the crash and Voice C's simulation of the same annonncement

can be seen in Fig. 10. In both narrow-band spectra- grams, there are instants in time at which rapid jumps or "tremors" in Fo OCthr.

These comparative data and some additional limited data obtained from real life emotional situations 2 are not inconsistent with the data obtained from the actors

in this study. Quantitative data on the median F0 and the range of

F0 for the radio announcer and for Voice C simulating the description are given in Table II. The data were de- rived from samples of speech of several seconds' dura-

TABLE II. Median fundamental frequency (Fo) and range of fundamental frequency for selected utterances obtained from the original radio description of the HINDE27B URG disaster and from Voice C's simulated announcement of the same event. (From Williams and Stevens, Ref. 1.)

Fo Fo Range

Original Radio Descrlpthm Before crash 166 124-196 After crash 196 152 260

Volc½ C's Sinrelated Description Before crash 138 117-168 After crash 222 117-280

1248 Volume 52 Number 4 (Part 2) 1972

tion before the crash and after the crash. The increased

median F0 and the greater range of F0 for the emotional situation are quite apparent both for the announcer and for Voice C. The increase in both the median F0 and in the range of F0 was greater for Voice C than for the announcer.

V. SUMMARY OF GROSS ACOUSTIC

ATTRIBUTES ASSOCIATED WITH

VARIOUS EMOTIONS

A. Anger

The most consistent and striking acoustic manifesta- tion of the emotion anger was a high Fo that persisted throughout a breath group. This increase was, on the average, at least half an octave above the Fo for a neutral situation. The range of F0 observed for utter- ances spoken in anger situations was also considerably greater than the range for the neutral situations. Some syllables were produced with increased intensity or emphasis, and the vowels in these syllables had the highest fundamental frequency. These syllables also tended to have weak first formants, and were often generated with some voicing irregularity (i.e., irregular fluctuations from one glottal pulse to the next). The basic opening and closing articulatory gestures charac- teristic of the vowel-consonant alternation in speech appeared to be more extreme when a speaker was angry;

Page 12: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

EMOTIONS

the vowels tended to be produced with a more open vocal tract (and hence to have higher first-formant fre- quencies), and the consonants were generated with a more clearly defined closure. The durations of utter- ances spoken in anger were usually longer (and the syllabic rate lower), but this effect was not great and was not always consistent for all voices. Although the general manifestations of anger were similar for the three voices, there were some individual differences. The increase in fundamental frequency was greater for some voices than for others, and there were differences in the way duration and other characteristics changed.

B. Fear

The average F0 for fear was lower than that observed for anger, and for some voices it was close to that for utterances spoken in neutral situations. Tl•ere were, however, occasional peaks in the F0 that were much higher than those encountered in a neutral situation. These peaks were interspersed with regions where the fundamental frequency was in a normal range. The pitch contours in the vicinity of the peaks sometimes had unusual shapes (irregular bumps or discontinuities), and voicing irregularity was sometimes present. The duration of an utterance tended to be longer tlaan in the case of anger or netltral situations. As was observed for anger, the vowels and consonants produced in a fear situation were often more precisely articulated than they were in a neutral situation. Although these various characteristics were found for some utterances of some

voices, observations of spectrograms revealed no clear and consistent correlate for the emotion fear.

C. Sorrow

The average fundamental frequency observed for the actors speaking in sorrow situations was considerably lower than that for neutral situations and the range of Fo was usually quite narrow. This change in F0 was accompanied by a marked decrease in rate of articula- tion (by a factor of two or more in syllabic rate, on the average) and an increase in the duration of an utterance. The increased duration resulted from longer vowels and consonants and from pauses that were often inserted in a sentence. Perhaps the most striking effe. ct on the wide-band spectrogram was voicing irregularity. On occasion the voicing irregularity reduced simply to noise; i.e., the voiced sounds became whispered, in effect.

D. Neutral

In neutral situations the spectrograms of the actors •enerally showed a well-defined structure durina the vowels, with little noise or irregularities either between the formants or in the high-frequency reõions where formants are often not visible. Consonants were fre-

quently uttered in an imprecise manner, particularly

IN SPEECH

when they appeared in unstressed syllables. Sentences were usually generated with shorter durations than for the emotional situations.

VI. CONCLUSIONS

The aspect of the speech signal that appears to pro- vide the clearest indication of the emotional state of a talker is the contour of F0 vs time. This contour has a prototype shape for a breath group that is generated in a normal manner, without marked emotions of any kind. The normal contour is characterized by smooth, slow, and continuous changes in F0 as a function of time, the changes occurring in syllables on which emphasis or linguistic stress is to be placed. Emotions appear to have several effects on this basic contour shape.

While at present it is certainly not possible to specify any quantitative automatic procedures that reliably indicate the emotional state of a talker, measurements of the median F0 and range of F0 for a sample of speech of several seconds' duration may at least serve to classify a talkefts emotional state as one of sorrow (reduced F0 and decreased range), or of anger or fear (increased F0 and range), assuming that the normal F0 and range of F0 for the talker are known. Further identification of the emotions must be done by an experienced observer who must look for certain attri- butes of the F0 contour, shifts in spectrum, changes in duration, and voicing irregularities.

ACKNOWLEDGMENTS

The authors acknowledge with gratitude the support of the Bureau of Medicine and Surgery, Department of the Navy, and the valuable contributions made by the following individuals during the early stages of this work: Michael H. L. Hecker, Dr. Stephen G. Langley, Barbara Woods, and Sarah Heintz.

* From the Naval Aerospace Med. Res. Lab., Pensacola, Fla. Opinions or conclusions contained in this report are those of the authors and do not necessarily reflect the views or endorsement of the Navy Department.

p This article is based on portions of a monograph cited in Ref. 2 below. Refer to the monograph for a more detailed description of this study, • review of the literature, a description and results of some listener tests, and a bibliography.

x C. E. Williams and K. N. Stevens, "On Determining the Emotional State of Pilots During Flight: An Exploratory Study," Aerospace Med. 40, 1369-1372 (1969).

• C. E. Williams and K. N. Stevens, "The Emotional Content of Speech: Some Exploratory Acoustical Studies," NAMRL Monogr. 18, Naval Aerospace Med. Res. Lab., Pensacola, Fla., (1972).

a C. E. Williams, K. N. Stevens, and M. H. L. Hecker, "Acoustical Manifestations of Emotional Speech," J. Acoust. Soc. Amer. 47, 66(A) (1970).

• G. Fairbanks and W. Pronovost, "An Experimental Study of the Pitch Characteristics of the Voice During the Expression of Emotions," Speech Monogr. 6, 87-194 (1939).

• G. Fairbanks and L. W. Hoaglin, "An Experimental Study of the Durational Characteristics of the Voice During the Expres- sion of Emotion," Speech Monogr. 8, 85-90 (1941).

The Journal of the Acoustical Society of America 1249

Page 13: Emotions and Speech: Some Acoustical Correlates · corded speech material. Since we believed that the emotions of interest might best be described in terms of specific situations

WILLIAMS AND STEVENS

a J. R. Davitz, The Communication of Emotional Meaning (McGraw-Hill, New York, 1964), Chap. 8, pp. 101-112.

7 V. A. Popov, P. V. Simonov, A. G. Tishchenko, M. V. Frolov, and L. S. Khachatur'yants, "Analysis of Intonational Characteristics of Speech as a Criterion of the Emotional State of Man Under Conditions of Space Flight," Zh. Vysshey Nervnoy Deyatel'nosti •J. Higher Nervous Activity] 16, 974-983 (1966).

a D. B. Lindsley, "Emotion," in Handbook of Experbnental Psychology, S.S. Stevens, Ed. (Wiley, New York, 1951), p. 477.

•V. A. Popov, P. V. Simonov, M. V. Frolov, and L. S. Khachatur'yants, "Frequency Spectrum of Speech as an Indicator of the Degree and Nature of Emotional Stress," Zh. Vysshey Nervnoy Deyatel'nosti 1, 104-109 (1971).

1250 Volume 52 Number 4 (Part 2) 1972