belief updating in spoken dialog systems
DESCRIPTION
Belief Updating in Spoken Dialog Systems. Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004. Misunderstandings. Misunderstandings are an important problem in spoken dialog systems - PowerPoint PPT PresentationTRANSCRIPT
Belief Updating in Spoken Dialog SystemsDialogs on Dialogs Reading Group June, 2005
Dan BohusCarnegie Mellon University, January 2004
2
Misunderstandings
Misunderstandings are an important problem in spoken dialog systems System obtains an incorrect semantic interpretation of the
users’ utterance
15-40% of turns Significant negative impact on overall success rate
3
Confidence annotation
Use confidence scores to guard against potential misunderstandings
Traditionally: from speech recognition engine [Chase, Bansal, Cox, Kemp, etc]
Focuses on WER, not tuned to task at hand More recently: system-specific semantic
confidence scores [Carpenter, Walker, San-Segundo, etc]
Integrate knowledge from different levels in the system: speech recognition, language understanding, dialog management
4
Correction Detection
Detect whether or not the user is trying to correct the system Related: aware-site detection
Similar ML approaches using multiple sources of knowledge [Litman, Swerts, Krahmer, etc]
5
S: Where are you flying from?
U: [CityName={Aspen/0.6; Austin/0.2}]S: Did you say you wanted to fly out of Aspen?
U: [No/0.6] [CityName={Boston/0.8}]
Proposed: Belief Updating
Integrate confidence annotation and correction detection in a unified framework for continuously tracking beliefs
[CityName={Aspen/?; Austin/?; Boston/?}]
A “belief updating” problem:
initial belief+
system action+
user response
updated belief
6
Formally…
Given: An initial belief Pinitial(C) over concept C A system action SA A user response R
Construct an updated belief Pupdated(C) As “accurate” as possible
Pupdated(C) ← f (Pinitial(C), SA, R)
7
Examples
8
Examples - continued
9
Outline
Introduction Data A simplified version of the problem. Approach User behaviors Learning: Preliminary results More on evaluation Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
10
Data
Collected in an experiment with RoomLine Phone-based, mixed initiative system for making conference
room reservations Equipped with explicit and implicit confirmations
Corpus statistics 46 participants 449 sessions, 8278 turns 13.5% misunderstandings [9.8% / 22.5%] 25.6% WER [19.6% / 39.5%] 11362 concept updates
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
11
System actions and concept updates
Explicit and implicit confirmations
Start time: Explicit Confirmation/grounding [EC]Date: Implicit Confirmation/grounding [IC]
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
12
System actions and concept updates
Date: Implicit Confirmation/grounding [IC]Start time: Implicit Confirmation/grounding [IC]End time: Implicit Confirmation/task [ICT]
Implicit Confirmations Task
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
13
# of Conflicting Hypotheses
Below 3% involve more than 1 hypothesis
System not using multiple hypotheses
[Future work: regenerate multiple hypotheses in batch]
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
14
Outline
Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
15
A Simplified Version
Given only 3% have more than 1 hypothesis,
Update belief in the top-hypothesis after implicit and explicit confirmations
Instead of Pupdated(C) ← f (Pinitial(C), SA, R)
Do ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R) For SA = {EC, IC, ICT}
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
16
Approach
Use machine learning Dataset
Concept updates for EC, IC, ICTs
Features Initial confidence score ConfTopinitial(C) System action (SA) User response (R)
Target Updated confidence score ConfTopupdated(C) Data is labeled, so we have a binary target
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
17
Outline
Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
18
User behaviors
Study of user behaviors in response to ICs and ECs Can inform feature selection and feature development Provide insights into where the difficulties are Can inform potential strategy refinements
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
19
User responses to ECs
Transcripts
Decoded
YES NO Other
CORRECT 1097 [94.2% of cor]
8 62
INCORRECT 3 202 [69.9% of inc]
84
YES NO Other
CORRECT 1016 [87.3% of cor]
11 137
INCORRECT 2 171 [69.9% of inc]
116
~10%
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
20
“Other” Responses to EC
“Eyeball” estimates (out of 146 responses) ~70% simply repeat the correct concept value
That should come in as a handy feature
~10% change conversation focus ~10% turn overtaking issues
Maybe inhibit barge-in until Antoine finishes his thesis
~10% other
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
21
User responses to ICs
Transcripts
Decoded
YES NO Other
CORRECT 166 [31.3% of cor]
38 326
INCORRECT 15 75 [31.5% of inc]
148
YES NO Other
CORRECT 151 [28.5% of cor]
20 369
INCORRECT 16 62 [26.1% of inc]
160
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
22
Users Don’t Always Correct ICs
Actually, they corrected in 45% of the casesUser does not
correct User corrects
CORRECT 557 1
INCORRECT 126 [55% of incor]
104[45% of incor]
That means if we knew exactly when they correct, we’d still have (126+1)/788 = 16% error
So what do users do when they don’t correct? They may actually correct partially Completely ignore the error … (if non-essential) Readjust to accommodate task
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
23
More questions…
Understand better this “ignore” phenomenon Impact on task success?
IC correction rate: 49% (successful tasks) vs 41% (unsuccessful) Fixed vs more “flexible” scenarios
Impact of prompt length on P(user will correct)? “Essential” vs “non-essential” concepts?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
24
Outline
Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
25
Which ML technique?
Need good probability outputs Margins produced by discriminant classifiers are inadequate If you want probability scores, i.e. conf = 0.85 means that in
85% of cases with conf=0.85 the concept is right evaluate on a soft-metric [I’ll contradict myself later!! ]
Step-wise logistic regression Sample-efficient Feature selection Good soft-metric performance
optimizes for avg. log likelihood of data
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
26
Data. Features For each system action {EC, IC, ICT}
Initial Confidence score Other indicators about current state:
How well has the dialog been going Which concept are we talking about How far back was this concept acquired
Features on user response Confirmation and Disconfirmation markers Acoustic / Prosodic: f0 (min, max, range, maxslope, etc) + normalized
versions Num words; turn length (secs) Concept information: expected / repeated / new concepts and grammar
slots… Confidence Barge-in & Timeout info Lexical features (preselected by MI with “target” or confirm/disconfirm
markers)
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
27
Results
Actually using a 1-level logistic model-tree Split on answer_type = {yes, no, other, no_parse} Perform step-wise logistic regression on the 4 leaves
P-entry = 0.05 P-reject = 0.30 BIC stopping criterion
Also tried full-blown model tree, results are similar, maybe marginally worse
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
28
Explicit Confirmation
HARD SOFT
Initial 31.1% -0.5076
Heuristic 8.6% -0.1943
LMT(CV) 3.7% -0.1160
LMT(training) 2.9% -0.0851
Initial Heuristic LMT(CV) LMT(training)0
5
10
15
20
25
30
35
Erro
r rat
e (%
)
Initial Heuristic LMT(CV) LMT(training)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Avg
. Log
-Lik
elih
ood
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
29
Implicit Confirmation
Initial Heuristic LMT(CV) LMT(training)0
5
10
15
20
25
30
35
Erro
r rat
e (%
)
Initial Heuristic LMT(CV) LMT(training)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Avg
. Log
-Lik
elih
ood
HARD SOFT
Initial 31.4% -0.6217
Heuristic 24.0% -0.6736
LMT(CV) 19.6% -0.4521
LMT(training) 18.8% -0.4124
Oracle Baseline 16.1% -
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
30
Outline
Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
31
What can Logistic Regression / AVG-LL do for you?
D = {d1, d2, d3, d4, …} di = 1/0 P(D) = ∏P(di=1 | xi) Express density P(di=1 | xi) as:
P(d=1 | x) = 1 / (1 + exp(-wx)) You can actually derive this if you start with P(x | d) gaussian
Find parameters w to max(P(D)) argmax(P(D)) = argmax ∏P(di=1 | xi) argmax(P(D)) = argmin ∑-log(P(di=1 | xi))
Hence we maximize the average log-likelihood
But what does that mean?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
32
Loss function in Logistic Regression
Log-likelihood loss function
0.01 0.1 0.7 0.8 1
If d=1, then P(d=1)=0.01 is ten times worse than P(d=1)=0.1,but P(d=1)=0.7 is about the same as P(d=1)=0.8
Things are mirrored for d=0
d=1
This does not match the “threshold” model commonly used to engage actions
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
33
A New Loss Function: T2
A loss function that better matches our domain: T2 (or even T3)
Optimize argmax ∑ T2(P(di=c | xi)) Not differentiable Not convex
0 t1 t2 1
d=1
C1
C2
0 t1 t2 1
d=0C3
C4
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
34
Smoothed version
A loss function that better matches our domain: T2 (or even T3)
Optimize argmax ∑ SmoothT2(P(di=c | xi)) Differentiable! But still not convex … multiple local maxima
0 t1 t2 1
d=1
C1
C2
SmoothT2(p) = σ1(p) + σ2(p)σi(p) = 1 / (1+exp(ki(p-θi)))
with ks and θs chosen accordingly
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
35
Costs & Thresholds
Costs: where from? “Expert” knowledge Derive from data (might be tricky)
Thresholds: where from? Fixed Actually optimize at the same time
SmoothT2 = SmoothT2(w, th1, th2) Differentiable in th1 and th2, so we can do gradient search for it
Calibrates in one step both the belief updating and the threshold to minimize loss
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
36
Questions: What Next?
ICT: can we do anything there? Looks really tough
Push for better performance … Add more features? … Debug the models more, eliminate singularities … Why doesn’t the model-tree do better?
Push for better understanding … What are the other interesting questions …
Optimize for new loss function More in the future: look at the full belief
updating problem
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
37
Thank You!
38
Encoding System Actions
For each concept update, define system action signature: <IC, ICT, EC, REQ> IC: Implicit Confirm [grounding] ICT: Implicit Confirm [task] EC: Explicit Confirm REQ: Request
Each variable can have 1 of 4 values 0 C (action happens on concept of interest) OC (action happens on some other concept) C&OC (action happens both on concept of interest and some other
concept) Only certain combinations are valid and appear in the
data