“k hypotheses + other” belief updating in spoken dialog systems dialogs on dialogs talk, march...
Post on 21-Dec-2015
218 views
TRANSCRIPT
“k hypotheses + other” belief updating in spoken dialog systems
Dialogs on Dialogs Talk, March 2006
Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213
2
problem
spoken language interfaces lack robustness when faced with understanding errors
errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions
3
guarding against understanding errors
use confidence scores machine learning approaches for detecting
misunderstadings [Walker, Litman, San-Segundo, Wright, and others]
engage in confirmation actions explicit confirmation
did you say you wanted to fly to Seoul? yes → trust hypothesis no → delete hypothesis “other” → non-understanding
implicit confirmationtraveling to Seoul … what day did you need to travel? rely on new values overwriting old values
related work : data : user response analysis : proposed approach: experiments and results : conclusion
4
construct accurate beliefs by integrating information over multiple turns in a conversation
today’s talk …
S: Where would you like to go?U: Huntsville
[SEOUL / 0.65]
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
U: no no I’m traveling to Birmingham[THE TRAVELING TO BERLIN P_M / 0.60]
5
belief updating: problem statement
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
[THE TRAVELING TO BERLIN P_M / 0.60]
given an initial belief Binitial(C) over
concept C a system action SA a user response R
construct an updated belief Bupdated(C) ← f (Binitial(C), SA, R)
6
outline
proposed approach
data
experiments and results
effect on dialog performance
conclusion
proposed approach: data: experiments and results : effect on dialog performance : conclusion
7
belief updating: problem statement
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
[THE TRAVELING TO BERLIN P_M / 0.60]
given an initial belief Binitial(C) over
concept C a system action SA(C) a user response R
construct an updated belief Bupdated(C) ← f(Binitial(C),SA(C),R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
8
belief representationBupdated(C) ← f(Binitial(C), SA(C), R)
most accurate representation probability distribution over the set of possible
values
however system will “hear” only a small number of
conflicting values for a concept within a dialog session
in our data max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard
proposed approach: data: experiments and results : effect on dialog performance : conclusion
9
compressed belief representation
k hypotheses + other at each turn, the system
retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k)
belief representationBupdated(C) ← f(Binitial(C), SA(C), R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
10
B(C) modeled as a multinomial variable {h1, h2, … hk, other}
B(C) = <ch1, ch2, …, chk, cother> where ch1 + ch2 + … + chk + cother = 1
belief updating can be cast as multinomial regression problem:
Bupdated(C) ← Binitial(C) + SA(C) + R
belief representationBupdated(C) ← f(Binitial(C), SA(C), R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
11
request S: For when do you want the room?U:Friday
[FRIDAY / 0.65]
explicit confirmation
S: Did you say you wanted a room for Friday?U:Yes
[GUEST / 0.30]
implicit confirmation
S: a room for Friday … starting at what time?U:starting at ten a.m.
[STARTING AT TEN A_M / 0.86]
unplanned implicit confirmation
S: I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room?U:not Friday, Thursday
[FRIDAY THURSDAY / 0.25]
no action /unexpected update
S: okay. I will complete the reservation. Please tell
me your name or say ‘guest user’ if you are not
a registered user.U:guest user
[THIS TUESDAY / 0.55]
system actionBupdated(C) ← f(Binitial(C), SA(C), R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
12
acoustic / prosodic
acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause, etc;
lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation).
grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc;
dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity
priors priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w)
confusability empirically derived confusability scores
Bupdated(C) ← f(Binitial(C), SA(C), R)user response
proposed approach: data: experiments and results : effect on dialog performance : conclusion
13
approach
problem <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C),
R)
approach: multinomial generalized linear model regression model, multinomial independent variable sample efficient stepwise approach
feature selection BIC to control over-fitting
one model for each system action <uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R)
Bupdated(C) ← f(Binitial(C), SA(C), R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
14
outline
proposed approach
data
experiments and results
effect on dialog performance
conclusion
proposed approach: data: experiments and results : effect on dialog performance : conclusion
15
data
collected with RoomLine a phone-based mixed-initiative spoken dialog
system
conference room reservation
explicit and implicit confirmations
simple heuristic rules for belief updating explicit confirm: yes / no
implicit confirm: new values overwrite old ones
proposed approach: data: experiments and results : effect on dialog performance : conclusion
16
corpus
user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success
corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated
misunderstandings corrections correct concept values
proposed approach: data: experiments and results : effect on dialog performance : conclusion
17
outline
proposed approach
data
experiments and results
effect on dialog performance
conclusion
proposed approach: data: experiments and results : effect on dialog performance : conclusion
18
baselines
initial baseline accuracy of system beliefs before the update
heuristic baseline accuracy of heuristic update rule used by the
system
oracle baseline accuracy if we knew exactly when the user
corrects
proposed approach: data: experiments and results : effect on dialog performance : conclusion
19
k=2 hypotheses + other
priors and confusability
initial confidence score
concept identity
barge-in
expectation match
repeated grammar slots
Informative features
proposed approach: data: experiments and results : effect on dialog performance : conclusion
20
outline
proposed approach
data
experiments and results
effect on dialog performance
conclusion
proposed approach: data: experiments and results : effect on dialog performance : conclusion
21
a question remains …
… does this really matter?
what is the effect on global dialog performance?
proposed approach: data: experiments and results : effect on dialog performance : conclusion
22
let’s run an experiment
guinea pigs from Speech Lab for exp: $0
getting change from guys in the lab: $2/$3/$5
real subjects for the experiment: $25
picture with advisor of the VERY last exp at CMU: priceless!!!!
[courtesy of Mohit Kumar]
23
a new user study …
implemented models in RavenClaw, performed a new user study 40 participants, first-time users 10 scenario-driven interactions each
non-native speakers of North-American English improvements more likely at higher WER
supported by empirical evidence
between-subjects; 2 gender-balanced groups control: RoomLine using heuristic update rules treatment: RoomLine using runtime models
proposed approach: data: experiments and results : effect on dialog performance : conclusion
24
effect on task success
proposed approach: data: experiments and results : effect on dialog performance : conclusion
73.6%
81.3%
control
treatment
tasksuccess
control
treatment
even though
averageuser WER
21.9%
24.2%
25
0 20% 40% 60% 80% 100%0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%treatment
control
effect on task success … a closer look
proposed approach: data: experiments and results : effect on dialog performance : conclusion
Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition
probability of task success
word error rate
16% WER30% WER
64%
78%
p=0.001
78%
26
0 10 20 30 40 50 60 70 80 90 1000.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
improvements at different WER
proposed approach: data: experiments and results : effect on dialog performance : conclusion
word-error-rate
abso
lute
Im
pro
vem
ent
in t
ask
succ
ess
27
effect on task duration (for successful tasks)
ANOVA on task duration for successful tasksDuration ← -0.21 + 0.013∙WER - 0.106∙Condition
significant improvement, equivalent to 7.9% absolute reduction in WER
proposed approach: data: experiments and results : effect on dialog performance : conclusion
28
outline
proposed approach
data
experiments and results
effect on dialog performance
conclusion
proposed approach: data: experiments and results : effect on dialog performance : conclusion
29
summary
data-driven approach for constructing accurate system beliefs integrate information across multiple turns
bridge together detection of misunderstandings and corrections
significantly outperforms current heuristics
significantly improves effectiveness and efficiency
30
other advantages sample efficient
performs a local one-turn optimization good local performance leads to good global
performance
scalable works independently on concepts 29 concepts, varying cardinalities
portable decoupled from dialog task specification doesn’t make strong assumptions about
dialog management technology
31
thank you! questions …
32
user study
10 scenarios, fixed order presented graphically (explained during briefing)
participants compensated per task success