“k hypotheses + other” belief updating in spoken dialog systems dialogs on dialogs talk, march...

“k hypotheses + other” belief updating in spoken dialog systems

Dialogs on Dialogs Talk, March 2006

Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213

2

problem

spoken language interfaces lack robustness when faced with understanding errors

errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions

3

guarding against understanding errors

use confidence scores machine learning approaches for detecting

misunderstadings [Walker, Litman, San-Segundo, Wright, and others]

engage in confirmation actions explicit confirmation

did you say you wanted to fly to Seoul? yes → trust hypothesis no → delete hypothesis “other” → non-understanding

implicit confirmationtraveling to Seoul … what day did you need to travel? rely on new values overwriting old values

related work : data : user response analysis : proposed approach: experiments and results : conclusion

4

construct accurate beliefs by integrating information over multiple turns in a conversation

today’s talk …

S: Where would you like to go?U: Huntsville

[SEOUL / 0.65]

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

U: no no I’m traveling to Birmingham[THE TRAVELING TO BERLIN P_M / 0.60]

5

belief updating: problem statement



destination = {?}

[THE TRAVELING TO BERLIN P_M / 0.60]

given an initial belief Binitial(C) over

concept C a system action SA a user response R

construct an updated belief Bupdated(C) ← f (Binitial(C), SA, R)

6

outline

proposed approach

data

experiments and results

effect on dialog performance

conclusion

proposed approach: data: experiments and results : effect on dialog performance : conclusion

7

belief updating: problem statement



destination = {?}

[THE TRAVELING TO BERLIN P_M / 0.60]

given an initial belief Binitial(C) over

concept C a system action SA(C) a user response R

construct an updated belief Bupdated(C) ← f(Binitial(C),SA(C),R)


8

belief representationBupdated(C) ← f(Binitial(C), SA(C), R)

most accurate representation probability distribution over the set of possible

values

however system will “hear” only a small number of

conflicting values for a concept within a dialog session

in our data max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard


9

compressed belief representation

k hypotheses + other at each turn, the system

retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k)



10

B(C) modeled as a multinomial variable {h1, h2, … hk, other}

B(C) = <ch1, ch2, …, chk, cother> where ch1 + ch2 + … + chk + cother = 1

belief updating can be cast as multinomial regression problem:

Bupdated(C) ← Binitial(C) + SA(C) + R



11

request S: For when do you want the room?U:Friday

[FRIDAY / 0.65]

explicit confirmation

S: Did you say you wanted a room for Friday?U:Yes

[GUEST / 0.30]

implicit confirmation

S: a room for Friday … starting at what time?U:starting at ten a.m.

[STARTING AT TEN A_M / 0.86]

unplanned implicit confirmation

S: I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room?U:not Friday, Thursday

[FRIDAY THURSDAY / 0.25]

no action /unexpected update

S: okay. I will complete the reservation. Please tell

me your name or say ‘guest user’ if you are not

a registered user.U:guest user

[THIS TUESDAY / 0.55]

system actionBupdated(C) ← f(Binitial(C), SA(C), R)


12

acoustic / prosodic

acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause, etc;

lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation).

grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc;

dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity

priors priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w)

confusability empirically derived confusability scores

Bupdated(C) ← f(Binitial(C), SA(C), R)user response


13

approach

problem <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C),

R)

approach: multinomial generalized linear model regression model, multinomial independent variable sample efficient stepwise approach

feature selection BIC to control over-fitting

one model for each system action <uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R)

Bupdated(C) ← f(Binitial(C), SA(C), R)


14

outline

proposed approach

data



conclusion


15

data

collected with RoomLine a phone-based mixed-initiative spoken dialog

system

conference room reservation

explicit and implicit confirmations

simple heuristic rules for belief updating explicit confirm: yes / no

implicit confirm: new values overwrite old ones


16

corpus

user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success

corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated

misunderstandings corrections correct concept values


17

outline

proposed approach

data



conclusion


18

baselines

initial baseline accuracy of system beliefs before the update

heuristic baseline accuracy of heuristic update rule used by the

system

oracle baseline accuracy if we knew exactly when the user

corrects


19

k=2 hypotheses + other

priors and confusability

initial confidence score

concept identity

barge-in

expectation match

repeated grammar slots

Informative features


20

outline

proposed approach

data



conclusion


21

a question remains …

… does this really matter?

what is the effect on global dialog performance?


22

let’s run an experiment

guinea pigs from Speech Lab for exp: $0

getting change from guys in the lab: $2/$3/$5

real subjects for the experiment: $25

picture with advisor of the VERY last exp at CMU: priceless!!!!

[courtesy of Mohit Kumar]

23

a new user study …

implemented models in RavenClaw, performed a new user study 40 participants, first-time users 10 scenario-driven interactions each

non-native speakers of North-American English improvements more likely at higher WER

supported by empirical evidence

between-subjects; 2 gender-balanced groups control: RoomLine using heuristic update rules treatment: RoomLine using runtime models


24

effect on task success


73.6%

81.3%

control

treatment

tasksuccess

control

treatment

even though

averageuser WER

21.9%

24.2%

25

0 20% 40% 60% 80% 100%0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%treatment

control

effect on task success … a closer look


Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition

probability of task success

word error rate

16% WER30% WER

64%

78%

p=0.001

78%

26

0 10 20 30 40 50 60 70 80 90 1000.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

improvements at different WER


word-error-rate

abso

lute

Im

pro

vem

ent

in t

ask

succ

ess

27

effect on task duration (for successful tasks)

ANOVA on task duration for successful tasksDuration ← -0.21 + 0.013∙WER - 0.106∙Condition

significant improvement, equivalent to 7.9% absolute reduction in WER


28

outline

proposed approach

data



conclusion


29

summary

data-driven approach for constructing accurate system beliefs integrate information across multiple turns

bridge together detection of misunderstandings and corrections

significantly outperforms current heuristics

significantly improves effectiveness and efficiency

30

other advantages sample efficient

performs a local one-turn optimization good local performance leads to good global

performance

scalable works independently on concepts 29 concepts, varying cardinalities

portable decoupled from dialog task specification doesn’t make strong assumptions about

dialog management technology

31

thank you! questions …

32

user study

10 scenarios, fixed order presented graphically (explained during briefing)

participants compensated per task success

“k hypotheses + other” belief updating in spoken dialog systems dialogs on dialogs talk, march...

Documents