the challenge of analyzing conversational speech

35
Advanced Signal Processing II Johannes Luig The Challenge of Analyzing Conversational Speech

Upload: others

Post on 21-Dec-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II

Johannes Luig

The Challenge of Analyzing

Conversational Speech

Page 2: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Contents

‣ Introduction

‣ The Four Main Challenges

- Hidden Punctuation

- Disfluencies

- Turn-Taking

- Emotions

‣ Conclusion / Outlook

2

Page 3: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Introduction

3

Page 4: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Written vs. Spoken Language

- Written Language- explicit punctuation (word and sentence boundaries)

- Spoken Language- stream of words- no obvious lexical marking- phrasing expressed through prosody

4

Page 5: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Importance of word and sentence boundaries

- Automatic downstream processing- Parsing- Information extraction- Dialog act modeling- Summarization- Translation

- Human readability

- (Performance of speech recognition)

5

Page 6: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Importance of word and sentence boundaries

- Language Model creation (the „common“ way)- Recording of spoken text- Acoustical segmentation into sentence-like units

- Chopping at longer pauses and speaker changes

- Problems introduced by conversational speech- Pause ≠ indicator of sentence boundary:

- Sentences strung together without pauses- Pauses occur at locations which are no sentence boundaries

6

Page 7: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Particular Task: Endpointing

- Determination when an utterance is „complete“- Application in online human-computer dialog systems- Online (real-time) procedure

- Common method:- Waiting for a pause that is longer than pre-defined

threshold

7

Page 8: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

from [5]

Page 9: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

Pause = silence > 30ms

from [5]

Page 10: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

from [5]

Page 11: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

from [5]

Page 12: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

from [5]

Page 13: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Hidden Punctuation

‣ Example: Prosody-based Approach

8

DP = {DP1, DP2, ..., DPN} ... set of decision points

from [5]

Page 14: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Types of disfluencies

- Repetitions- Speaker repeats some part of the utterance:

I ... I like it.

- Revisions- Speaker modifies some part of the utterance:

We ... I like it.

- Restarts- Speaker abandons an utterance and then starts over:

It’s also ... I like it.

9

Page 15: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Types of disfluencies

- Filled pauses- Editing/correction is „commented“:

We ... I mean, I like it.

10

Page 16: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

‣ Structure of disfluencies

We ... I mean, I like it.

Disfluencies

11

Page 17: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

‣ Structure of disfluencies

We ... I mean, I like it.

Disfluencies

11

Reparandum Editing Phase Resumption

Page 18: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

‣ Structure of disfluencies

We ... I mean, I like it.

Disfluencies

11

Reparandum Editing Phase Resumption

Speech Recognition

Page 19: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Occurrence of disfluencies

- Frequency ∼ every 20 words

- up to 1/3 of all utterances affected

‣ Coping with disfluencies- Modeling rather than viewing them as „errors“

- Detection of interruption points:We ... I like it.

- For filled pauses, also the extent of the „cutaway“ region has to be determined

12

Page 20: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Occurrence of disfluencies

- Frequency ∼ every 20 words

- up to 1/3 of all utterances affected

‣ Coping with disfluencies- Modeling rather than viewing them as „errors“

- Detection of interruption points:We ... I like it.

- For filled pauses, also the extent of the „cutaway“ region has to be determined

12

Page 21: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Occurrence of disfluencies

- Frequency ∼ every 20 words

- up to 1/3 of all utterances affected

‣ Coping with disfluencies- Modeling rather than viewing them as „errors“

- Detection of interruption points:We ... I like it.

- For filled pauses, also the extent of the „cutaway“ region has to be determined

12

Page 22: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Example: Prosody- and LM-based Approach

- We ... I like it.

13

from [2]

Page 23: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Example: Prosody- and LM-based Approach

- We ... I like it.

13

from [2]

Page 24: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Example: Prosody- and LM-based Approach

- We ... I like it.

13

from [2]

Page 25: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Disfluencies

‣ Example: Prosody- and LM-based Approach

- We ... I like it.

13

from [2]

Language Models: „Hidden-Event“ LM

- word-based LM- part-of-speech based LM- repetition pattern LM

Page 26: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Turn-Taking

‣ Speaker alternation in conversations

- Speakers do not alternate sequentially

- Listeners anticipate the end of a speaker‘s contribution by analyzing

- Syntax- Semantics- Pragmatics- Prosody

- and start talking before current speaker is finished

14

Page 27: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Turn-Taking

‣ Example: Spectrogram Decomposition

- Assumption:- STFT vectors are outcomes of a discrete random process

generating frequency bin indices

- Modeling:- Random process: mixture multinomial distribution- Speaker distribution learned from training data

(supervised approach)

- separation → maximum likelihood estimates

15

Page 28: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Turn-Taking

‣ Example: Spectrogram Decomposition

- Assumption:- STFT vectors are outcomes of a discrete random process

generating frequency bin indices

- Modeling:- Random process: mixture multinomial distribution- Speaker distribution learned from training data

(supervised approach)

- separation → maximum likelihood estimates

15

from [6]

Page 29: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Emotions

‣ Hearing „more than words“

- Inherently difficult or a machine, since

- Rhetorical devices like irony require interpretation and contextual relation

- „Natural“ emotion data is very rare- Labeling of gathered data is difficult even for humans

- Intractable Problems:- Finding the classification reference („neutral“ state)- Setting the „emotion analysis time window“

16

Page 30: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Emotions

‣ Example: Features used for Emotion Analysis

- „Classic“ Features- Time domain: duration / speaking rate / pause- Frequency domain: pitch / energy / spectral tilt

- Features introduced by recent Approaches:- Loudness (RMS) across critical bands → Bark scale- Glottal-excitation-derived features- Voice quality features (Jitter, ...)- Harmonicity features

17

Page 31: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Conclusion / Outlook

‣ Four Main Challenges:

- Hidden Punctuation

- Disfluencies

- Turn-Taking

- Emotions

‣ Approaches mainly incorporate:

- Prosody

- Language Models

18

Page 32: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

Conclusion / Outlook

‣ Perspective

- Modeling and synthesis of conversational speech remains an interesting and challenging task

‣ Improvement Opportunities

- Improved basic features

- Incorporation of „longer-range“ information (greater time windows)

- Speaker-dependent modeling not only in frame-level acoustics, but also in lexical / prosodic patterns

19

Page 33: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

References[1] Spontaneous Speech: How People Really Talk and Why Engineers Should Care Elizabeth Shriberg Proc. EUROSPEECH, Lisbon (Portugal), 2005

[2] Automatic Disfluency Identification in Conversational Speech Using Multiple Knowledge Sources Yang Liu, Elizabeth Shriberg, Andreas Stolcke Proc. EUROSPEECH, Geneva (Switzerland), 2003

[3] How to Find Trouble in Communication Anton Batliner et al. Speech Communication, 40, 2003

[4] Conversational Speech Synthesis and the Need for Some Laughter Nick Campbell Journal of LaTeX Class files, Vol. 1, November 2002

20

Page 34: The Challenge of Analyzing Conversational Speech

Advanced Signal Processing II Conversational Speech Johannes Luig

References[5] A Prosody-based Approach to End-of-Utterance Detection that does not require Speech Recognition Luciana Ferrer, Elizabeth Shriberg, Andreas Stolcke Proc. ICASSP, 2003

[6] Latent Variable Decomposition of Spectrograms for Single-Channel Speaker Separation Bhiksha Raj, Paris Smaragdis IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2005

21

Page 35: The Challenge of Analyzing Conversational Speech

Thank you.