analysis and synthesis of shouted speech

31
Analysis and Synthesis of Shouted Speech

Upload: mason-rose

Post on 01-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Analysis and Synthesis of Shouted Speech. Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio. Shout. Shout is the loudest mode of vocal communication It is used for increasing the signal-to-noise ratio ( SNR) when communicating over an interfering noise - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis and Synthesis of Shouted Speech

Analysis and Synthesis of Shouted Speech

Page 2: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

2

• Shout is the loudest mode of vocal communication

• It is used for increasing the signal-to-noise ratio (SNR) when communicating• over an interfering noise• over a distance

• Shouting is also used for expressing emotions or intentions

Shout

Page 3: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

3

• Shout is produced by raising the subglottal pressure and increasing the vocal fold tension

• In effect, shout is characterized by• Increased sound pressure level (SPL)• Increased fundamental frequency (f0)• Increased amplitudes in mid-frequencies (1—4 kHz)• Increased duration and energy of vowels• Decreased duration and energy of consonants• Less accurate articulation

Properties of shout

Page 4: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

4

• Fortunately, shouting is used rarely, but it is an essential part of human vocal communication

• Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters

Why perform shout synthesis?

Page 5: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

5

In this study•Normal and shouted speech was recorded•Properties of normal and shouted speech were analyzed •Methods for producing natural sounding HMM-based synthetic shout are investigated

In this study…

Page 6: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

6

• Normal and shouted speech was recorded in an anechoid chamber• 22 Finnish speakers• 24 sentences of speech and shout from each speaker• A total of 1056 sentences• Subjects were asked to use very loud voice in shouting

• In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes

Recording of normal and shouted speech

Page 7: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

7

Page 8: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

8

• The following acoustic properties were analyzed from the recorded shouted and normal speech: • sound pressure level (SPL)• duration• fundamental frequency (f0)• spectrum• properties of the voice source:

• shape of the glottal pulse• H1-H2 parameter• NAQ parameter

Acoustic analysis of shout

Page 9: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

9

• On average (speech shout)• SPL increased 21 dB for females and 22 dB for males• Sentence duration increased 20% for females and 24% for males• f0 increased 71% for females and 152% for males• Spectrum was emphasized in the 1–4 kHz area

Acoustic analysis of shout – Results

Page 10: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

10

Page 11: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

11

Overall

Voiced

Unvoiced

Female Male

Page 12: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

12

Page 13: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

13

• Differences between normal speech and shout are large• This induces problems in many speech processing algorithms:

• Due to high f0, the accurate estimation of speech spectrum is difficult

• This is due to the biasing effect of the sparse harmonic structure of the shouted voice source

• Especially linear prediction (LP) is prone to this type of bias

Problems…

Page 14: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

14

• The biasing effect of the harmonics must be reduced• For this purpose, e.g. weighted linear prediction (WLP) can be used

• In WLP, the effect of the excitation to spectrum is reduced• This is done by weighting the squared residual with a specific

function

Spectrum estimation of shout

Page 15: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

15

LP vs. weighted linear prediction (WLP)

Conventional LP:

Weighted LP:

Page 16: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

16

Page 17: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

17

• Following spectrum estimation methods were compared for normal speech and shout:1. Conventional linear prediction (LP)2. WLP with STE weight (STE-WLP)3. WLP with AME weight (AME-WLP)

STE – short time energyAME – attenuation of the main excitation

Spectrum estimation of shout

Page 18: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

18

• Subjective listening tests indicate that• WLP-AME performs best with normal speech• WLP-STE performs best with shout

LP

WLP-STE

WLP-AME

LP vs. WLP in resynthesis

Page 19: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

19

• Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation)

Female Male

LP vs. WLP in HMM-based speech synthesis

Page 20: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

20

• HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout

Synthesis of shout (1)

Speech dataStatistical

model

Synthetic speechTraining Synthesis

Text

Page 21: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

21

• It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice

Shout data

Synthesis of shout (2)

Page 22: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

22

• Statistical adaptation of the normal speech model was used to generate synthetic shouted speech

Statistical model

Shout data

Adaptation

Training Synthesis

Text

Synthetic shout

Speech data

Synthesis of shout (3)

Page 23: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

23

• Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech

Shout data

Voice conversion

Statistical model

Training Synthesis

Text

Synthetic shout

Speech data

Synthesis of shout (4)

Page 24: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

24

• The following speech types were selected for the test:1. Natural normal speech2. Natural shout3. Synthetic normal speech4. Synthetic shout (adapted)5. Synthetic shout (voice conversion)

Evaluation (1)

Page 25: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

25

• MOS style listening test: the following properties were rated:1. How would you rate the quality of the speech sample?2. How much the sample resembles shouting?3. How much effort did speaker use for producing speech?

• Scale from 1 to 5 with verbal anchors• Loudness of the speech samples was normalized so that the ratings

are based on other aspects than SPL• 11 test subjects evaluated 50 samples each

Evaluation (2)

Page 26: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

26

Results – Naturalness

26

• Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected)

Normal synthesis

Shout synthesis

Page 27: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

27

Results – Impression of shouting

27

• The impression of shouting is, however, fairly well preserved

Natural shout

Synthetic shout

Page 28: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

28

Results – Vocal effort

28

• Adaptation produces better impression of the used vocal effort compared to voice conversion method

Adapted shout

Voice conversion shout

Page 29: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

29

• Synthesis of shout is challenging for many reasons:1. It is difficult to obtain large amounts of shout data with

consistent quality2. Differences between normal speech and shout are large, which

induces problems in many speech processing algorithms• In this work, the biasing effect of high-pitched shout was reduced by

using weighted linear predictive (WLP) methods• Subjective listening tests show the that WLP models work better with

shout than conventional LP

Summary (1)

Page 30: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

30

• In this study, synthetic shout was produced with two different techniques:1. Adaptation2. Voice conversion of the synthetic normal speech

• Methods were rated equal in quality• Impression of shouting and the use of vocal effort were better

preserved in the adapted shout

Summary (2)

Page 31: Analysis and Synthesis of Shouted Speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based VocodingAnalysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku

31

Thank you!

Male Female

Samples