context and prosody in the interpretation of cue phrases in dialogue julia hirschberg columbia...

Context and Prosody in the Interpretation of Cue Phrases in Dialogue

Julia HirschbergColumbia University and KTH

11/22/07

Spoken Dialog with Humans and MachinesSpoken Dialog with Humans and Machines

2

In collaboration with

Agustín Gravano, Stefan Benus, Héctor Chávez, Shira Mitchell, and Lauren Wilcox

With thanks to Gregory Ward and Elisa Sneed German

3

Managing Conversation

How do speakers indicate conversational structure in human/human dialogue?

How do they communicate varying levels of attention, agreement, acknowledgment?

What role does lexical choice play in these communicative acts? Phonetic realization? Prosodic variation? Prior context?

Can human/human behavior be modeled in Spoken Dialogue Systems?

4

Cue Phrases/Discourse Markers/Cue Words/ Discourse Particles/Clue Words

Linguistic expressions that can be employed to convey information about the discourse

structure, or to make a semantic (literal?) contribution.

Examples: now, well, so, alright, and, okay, first, on the other

hand, by the way, for example, …

5

Some Examples

that’s pretty much okay

Speaker 1: between the yellow mermaid and the whale

Speaker 2: okaySpeaker 1: and it is

okay we gonna be placing the blue moon

6

A Problem for Spoken Dialogue systems

How do speakers produce and hearers interpret such potentially ambiguous terms? How important is acoustic/prosodic information? Phonetic variation? Discourse context?

7

Research Goals

Learn which features best characterize the different functions of single affirmative cue words.

Determine how these can be identified automatically.

Important in Spoken Dialogue Systems: Understand user input. Produce output appropriately.

8

Overview

Previous research The Columbia Games Corpus

Collection paradigm Annotations

Perception Study of Okays Experimental design Analysis and results

Machine Learning Experiments on Okay Future work: Entrainment and Cue Phrases

9

Previous Work

General studies Schriffin ’82, ‘87; Reichman ’85; Grosz & Sidner

‘86 Cues to cue phrase disambiguation

Hirschberg & Litman ’87, ’93; Hockey ’93; Litman ’94

Cues to Dialogue Act identification Jurafsky et al ’98; Rosset & Lamel ’04

Contextual cues to the production of backchannels Ward & Tsukahara ’00; Sanjanhar & Ward ’06

10

The Columbia Games CorpusCollection

12 spontaneous task-oriented dyadic conversations in Standard American English (9h 8m speech)

2 subjects playing a series of computer games, no eye contact (45m 39s mean session time) 2 sessions per subject, w/different partners

Several types of games, designed to vary the way discourse entities became old, or ‘given’ in the discourse to study variation in intonational realization of information status

11

Player 2 (Searcher)

Player 1 (Describer)

Cards Game #1

• Short monologues• Vary frequency and order of

occurrence of objects on the cards.

12

Cards Game #2

Player 2 (Searcher)


• Dialogue• Vary frequency and order of

occurrence of objects on the cards across speakers.

13

Objects Game

Follower must place the target object where it appears on the Describer’s screen solely via the description provided (4h 19m)

Describer: Follower:

14

The Columbia Games CorpusRecording and Logging

Recorded on separate channels in soundproof booth, digitized and downsampled to 16k

All user and system behaviors logged

15

The Columbia Games CorpusAnnotation

Orthographic transcription and alignment (~73k words).

Laughs, coughs, breaths, smacks, throat-clearings. Self-repairs. Intonation, using ToBI conventions. Function (10 categories) of affirmative cue words

(alright, mm-hm, okay, right, uh-huh, yeah, yes, …).

Question form and function. Turn-taking behaviors.

18

Perception StudySelection of Materials

okay Speaker 1: but it's gonna be below the onionSpeaker 2: okay

Cue beginning discourse segment

Backchannel

Acknowledgment / Agreement

Speaker 1: okay alright I'll try it okaySpeaker 2: okay the owl is blinking

Speaker 1: yeah um there's like there's some space there'sSpeaker 2: okay I think I got it

19

contextualized ‘okay’

Perception StudyExperiment Design

54 instances of ‘okay’ (18 for each function). 2 tokens for each ‘okay’: Isolated condition: Only the word ‘okay’. Contextualized condition: 2 full speaker turns:

The turn containing the target ‘okay’; and The previous turn by the other speaker.

speakers okayokay

20

Perception StudyExperiment Design

1/3 each: 3 labelers agreed, 2…, none Two conditions:

Part 1: 54 isolated tokens Part 2: 54 contextualized tokens

Subjects asked to classify each token of ‘okay’ as: Acknowledgment / Agreement, or Backchannel, or Cue beginning discourse segment.

21

Perception StudyDefinitions Given to the Subjects

Acknowledge/Agreement: The function of okay that indicates “I believe what

you said” and/or “I agree with what you say”. Backchannel:

The function of okay in response to another speaker's utterance that indicates only “I’m still here” or “I hear you and please continue”.

Cue beginning discourse segment The function of okay that marks a new segment of

a discourse or a new topic. This use of okay could be replaced by now.

22

Perception StudySubjects and Procedure

Subjects: 20 paid subjects (10 female, 10 male). Ages between 20 and 60. Native speakers of English. No hearing problems.

GUI on a laboratory workstation with headphones.

23

Results: Inter-Subject Agreement

Kappa measure of agreement with respect to chance (Fleiss ’71)

Isolated Condition Contextualized Condition

Overall .120 .294

Ack / Agree vs. Other .089 .227

Backchannel vs. Other .118 .164

Cue beginning vs. Other .157 .497

24

Results:Cues to Interpretation

Phonetic transcription of okay:

Isolated Condition

Strong correlation for realization of initial vowel

Backchannel

Ack/Agree, Cue Beginning

Contextualized Condition

No strong correlations found for phonetic variants.

25

Results: Cues to Interpretation

Isolated Condition Contextualized Condition

Ack / Agree

Shorter /k/ Shorter latency between turns

Shorter pause before okay

Backchannel

Higher final pitch slope

Longer 2nd syllable

Lower intensity

Higher final pitch slope

More words by S2 before okay

Fewer words by S1 after okay

Cue beginning

Lower final pitch slope

Lower overall pitch slope

Lower final pitch slope

Longer latency between turns

More words by S1 after okay

S1 = Utterer of the target ‘okay’. S2 = The other speaker.

26

Phrase-final intonation (ToBI)(Both isolated and contextualized conditions.)

H-H% Backchannel

H-L%

L-H% Ack/Agree, Backchannel

L-L% Ack/Agree, Cue beginning

Results: Cues to Interpretation

27

Perception Study: Conclusions

Agreement: Availability of context improves inter-subject

agreement. Cue beginnings easier to disambiguate than the

other two functions. Cues to interpretation:

Contextual features override word features Exception: Final pitch slope of okay in both

conditions.

28

Machine Learning Experiments: Okay

Can we identify the different functions of okay in our larger corpus reliably?

What features perform best? How do these compare to those that predict human

judgments?

29

ML Algorithm JRip: Weka’s implementation of the propositional

rule learner Ripper (Cohen ’95). We also tried J4.8, Weka’s implementation of the

decision tree learner C4.5 (Quinlan ’93, ’96), with similar results.

10-fold cross validation in all experiments.

Method

30

Units of Analysis

IPU (Inter-pausal unit) Maximal sequence of words delimited by pause >

50ms.

Conversational Turn Maximal sequence of IPUs by the same speaker,

with no contribution from the other speaker.

31

Experimental features

Text-based features (from transcriptions) Word ident, POS tags (auto); position of word in IPU / turn IPU, turn length in words; prev turn same spkr?

Timing features (from time alignment) Word / IPU / turn duration; amount of spkr overlap Time to word beg/end in IPU, turn

Acoustic features {min, mean, max, stdev} x {pitch, intensity} Slope of pitch, stylized pitch, and intensity, over the whole

word, and over its last 100, 200, 300ms. Acoustic features from last IPU of prior speaker’s turn.

32

Results: Classification of individual words

Classification of each individual word into its most common functions. alright Ack/Agree, Cue Begin, Other mm-hm Ack/Agree, Backchannel okay Ack/Agree, Backchannel, Cue Begin,

Ack+CueBegin, Ack+CueEnd, Other right Ack/Agree, Check, Literal Modifier yeah Ack/Agree, Backchannel

34

Results: Classification of ‘okay’

Feature SetError Rate

F-MeasureAck /Agree

Back-channel

Cue Begin

Ack/Agree + Cue Begin

Ack/Agree + Cue End

Majority Label 1137 121 548 68 232

Text-based 31.7 .76 .16 .77 .09 .33

Acoustic 40.2 .69 .24 .64 .03 .25

Text-based + Timing 25.6 .79 .31 .82 .18 .67

Full set 25.5 .80 .46 .83 .21 .66

Baseline (1) 48.3 .68 .00 .00 .00 .00

Human labelers (2) 14.0 .89 .78 .94 .56 .73

(1) Majority class baseline: ACK/AGREE.(2) Calculated wrt each labeler’s agreement with the majority labels.

35

Conclusions: ML Experiments

Context and timing features Like perception in context results: timing

Pause after okay, not before # of succeeding words

Acoustic features impoverished No phonetic features No pitch slope But ToBI labels (where available) didn’t help

36

Future Work

Experiments with full ToBI labeling Other features

Lexical, Acoustic-Prosodic, and Discourse Entrainment and Dis-Entrainment Positive correlations for affirmative cue words

Affirmative cue word entrainment and game scores Affirmative cue word entrainment and overlaps and

interruptions in turn-taking

38

Other Work

Benus et al, 2007 “The prosody of backchannels in American

English”, ICPhS 2007, Saarbrücken, Germany, August 2007.

Gravano et al, 2007 “Classification of discourse functions of

affirmative words in spoken dialogue”, Interspeech 2007, Antwerp, Belgium, August 2007.

39

Importance for Spoken Dialogue Systems

Convey ambiguous terms with the intended meaning

Interpret the user’s input correctly

40

Experiment Design

Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position

Spontaneous speech Both monologue and dialogue

41

Experiment Design

Three computer games. Two players, each on a different computer.

They collaborate to perform a common task. Totally unrestricted speech.

42

Objects Game

Player 2 (Searcher)


• Dialogue• Vary target and surrounding objects

(subject and object position).

43

Games Session

Repeat 3 times: Cards Game #1 Cards Game #2

Short break (optional) Repeat 3 times:

Objects Game

Each subject participated in 2 sessions. 12 sessions

44

Subjects

Postings: Columbia’s webpage for temporary job adds. Craig’s list

http://www.craigslist.org Category: Gigs Event gigs

Problem: People are unreliable ~50% did not show up, or cancelled with short notice.

45

Subjects

Possible solutions: Give precise instructions to e-mail ALL required info:

Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they

show up (or cancel with adecuate notice). Increase the pay after each session.

Example: $5, $10, $15 instead of $10, $10, $10.

46

Recording

Sound-proof booth 2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker.

Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech 2.8 GB (16k)

47

Logs

Log everything the subjects do to a text file. Example:

17:03:55:234 BEGIN_EXECUTION

17:04:04:868 NEXT_TURN

17:04:31:837 RESULTS 97 points awarded.

17:04:38:426 NEXT_TURN

17:05:03:873 RESULTS 92 points awarded.

...

Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

context and prosody in the interpretation of cue phrases in dialogue julia hirschberg columbia...

Documents

blue moon slide

describer dialogue

spoken dialogue systems

humanhuman dialogue

elisa sneed german slide

okay speaker

discourse context

discourse structure