Transcript
Page 1: Using Reinforcement Learning to Build a Better Model of Dialogue State

Using Reinforcement Learning to Build a Better Model of Dialogue State

Joel Tetreault & Diane LitmanUniversity of PittsburghLRDCApril 7, 2006

Page 2: Using Reinforcement Learning to Build a Better Model of Dialogue State

Problem Problems with designing spoken dialogue systems:

What features to use? How to handle noisy data or miscommunications? Hand-tailoring policies for complex dialogues?

Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh

et al., ‘02; Walker, ‘00; Henderson et al., ‘05] However, very little empirical work on testing the

utility of adding specialized features to construct a better dialogue state

Page 3: Using Reinforcement Learning to Build a Better Model of Dialogue State

Goal

Lots of features can be used to describe the user state, which ones to you use?

Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make

5 features: certainty, student dialogue move, concept repetition, frustration, student performance

All are important to tutoring systems, but also are important to dialogue systems in general

Page 4: Using Reinforcement Learning to Build a Better Model of Dialogue State

Outline

Markov Decision Processes (MDP) MDP Instantiation Experimental Method Results

Page 5: Using Reinforcement Learning to Build a Better Model of Dialogue State

Markov Decision Processes

What is the best action an agent to take at any state to maximize reward at the end?

MDP Input: States Actions Reward Function

Page 6: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP Output

Use policy iteration to propagate final reward to the states to determine: V-value: the worth of each state Policy: optimal action to take for each state

Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action

Page 7: Using Reinforcement Learning to Build a Better Model of Dialogue State

What’s the best path to the fly?

Page 8: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP Frog Example

Final State: +1

-1

-1

-1

-1 -1

-1 -1

Page 9: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP Frog Example

Final State: +1

-2

-2

-2

-1 0

-1 0

-3

Page 10: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP’s in Spoken Dialogue

MDP

DialogueSystem

Training dataPolicy

User Simulator

HumanUser

MDP works offline

Interactions work online

Page 11: Using Reinforcement Learning to Build a Better Model of Dialogue State

ITSPOKE Corpus

100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]

All possible dialogue paths were authored by physics experts

Dialogues informally follow question-answer format

50 turns per dialogue on average Each student session has 5 dialogues

bookended by a pretest and posttest to calculate how much student learned

Page 12: Using Reinforcement Learning to Build a Better Model of Dialogue State
Page 13: Using Reinforcement Learning to Build a Better Model of Dialogue State

Corpus Annotations

Manual annotations: Tutor and Student Moves (similar to Dialog Acts)

[Forbes-Riley et al., ’05]

Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05]

Automated annotations: Correctness (based on student’s response to last

question) Concept Repetition (whether a concept is

repeated) %Correctness (past performance)

Page 14: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP State Features

Features Values

Correctness Correct (C) Incorrect/Partially Correct (I)

Certainty Certain (cer), Neutral (neu), Uncertain (unc)

Student Move Shallow (S), Deep/Novel Answer/Assertion (O)

Concept Repetition New Concept (0), Repeated (R)

Frustration Frustrated (F) , Neutral (N)

% Correctness 50-100% (H)igh, 0-49% (L)ow

Page 15: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP Action Choices

Case TMove Example Turn

Feed Pos “Super.”

NonFeed Hint, Ques. “To analyze the pumpkin’s acceleration we will use Newton’s Second Law. What is the definition of the law?”

Mix Pos, Rst, Ques.

“Good. So when the truck and car collide they exert a force on each other. What is the relationship between their magnitudes?”

Page 16: Using Reinforcement Learning to Build a Better Model of Dialogue State

MDP Reward Function

Reward Function: use normalized learning gain to do a median split on corpus:

10 students are “high learners” and the other 10 are “low learners”

High learner dialogues had a final state with a reward of +100, low learners had one of -100

)1(

)(

pretest

pretestposttestNLG

Page 17: Using Reinforcement Learning to Build a Better Model of Dialogue State

Infrastructure 1. State Transformer:

Based on RLDS [Singh et al., ’99]

Outputs State-Action probability matrix and reward matrix

2. MDP Matlab Toolkit (from INRA) to generate policies

Page 18: Using Reinforcement Learning to Build a Better Model of Dialogue State

Methodology

Construct MDP’s to test the inclusion of new state features to a baseline: Develop baseline state and policy Add a feature to baseline and compare polices A feature is deemed important if adding it results in a

change in policy from a baseline policy (“shifts”) For each MDP: verify policies are reliable (V-value

convergence)

Page 19: Using Reinforcement Learning to Build a Better Model of Dialogue State

Hypothetical Policy Change Example

B1 State Policy B1+Certainty

State

1 [C] Feed [C,Cer]

[C,Neu]

[C,Unc]

2 [I] Feed [I,Cer]

[I,Neu]

[I,Unc]

+Cert 1

Policy Feed

Feed

Feed

Mix

Mix

Mix

+Cert 2

Policy Mix

Feed

Mix

Mix

NonFeed

Mix

0 shifts 5 shifts

Page 20: Using Reinforcement Learning to Build a Better Model of Dialogue State

Tests

+%Correct

+Goal

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

+SMove

Page 21: Using Reinforcement Learning to Build a Better Model of Dialogue State

Baseline

Actions: {Feed, NonFeed, Mix} Baseline State: {Correctness}

Baseline network

[C]

FINAL

F|NF|Mix

[I] F|NF|Mix

F|NF|Mix

F|NF|Mix

F|NF|Mix

Page 22: Using Reinforcement Learning to Build a Better Model of Dialogue State

Baseline 1 Policies

Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback

# State State Size Policy

1 [C] 1308 Feed

2 [I] 872 Feed

Page 23: Using Reinforcement Learning to Build a Better Model of Dialogue State

But are our policies reliable?

Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work

Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus

Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

Page 24: Using Reinforcement Learning to Build a Better Model of Dialogue State

Baseline Convergence Plot

Page 25: Using Reinforcement Learning to Build a Better Model of Dialogue State

Methodology: Adding more Features Create more complicated baseline by adding

certainty feature (new baseline = B2) Add other 4 features (student moves, concept

repetition, frustration, performance) individually to new baseline

Check that V-values converge Analyze policy changes

Page 26: Using Reinforcement Learning to Build a Better Model of Dialogue State

Tests

+%Correct

+Goal

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

+SMove

Page 27: Using Reinforcement Learning to Build a Better Model of Dialogue State

Certainty

Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS

A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback

Page 28: Using Reinforcement Learning to Build a Better Model of Dialogue State

B2: Baseline + Certainty Policies

B1 State Policy B1+Certainty

State

+Certainty Policy

1 [C] Feed [C,Cer]

[C,Neu]

[C,Unc]

NonFeed

Feed

NonFeed

2 [I] Feed [I,Cer]

[I,Neu]

[I,Unc]

NonFeed

Mix

NonFeed

Trend: if neutral, give Feed or Mix, else give NonFeed

Page 29: Using Reinforcement Learning to Build a Better Model of Dialogue State

Baseline 1 and 2 Convergence Plots

Page 30: Using Reinforcement Learning to Build a Better Model of Dialogue State

Tests

+ %Correct

+Goal

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

+SMove

Page 31: Using Reinforcement Learning to Build a Better Model of Dialogue State

% Correct Convergence Plots

Page 32: Using Reinforcement Learning to Build a Better Model of Dialogue State

Student Move Policies

B2 B2 Policy B2 +SMove +Smove Policy

1 [Cer,C] NonFeed[Cer,C,S]

[Cer,C,O]

NonFeed

Feed

2 [Cer,I] NonFeed[Cer,I,S]

[Cer,I,O]

Mix

Mix

3 [Neu,C] Feed[Neu,C,S]

[Neu,C,O]

Feed

NonFeed

4 [Neu,I] Mix[Neu,I,S]

[Neu,I,O]

Mix

NonFeed

5 [Unc,C] NonFeed[Unc,C,S]

[Unc,C,O]

Mix

NonFeed

6 [Unc,I] NonFeed[Unc,I,S]

[Unc,I,O]

Mix

NonFeed

Trend: give Mix if shallow (S), give NonFeed if Other (O)

7 Changes

Page 33: Using Reinforcement Learning to Build a Better Model of Dialogue State

Concept Repetition Policies

Trend: if concept is repeated (R) give complex or mix feedback

B2 B2 Policy B2 +Concept +Concept Policy

1 [Cer,C] NonFeed[Cer,C,O]

[Cer,C,R]

NonFeed

Feed

2 [Cer,I] NonFeed[Cer,I,O]

[Cer,I,R]

Mix

Mix

3 [Neu,C] Feed[Neu,C,O]

[Neu,C,R]

Mix

Feed

4 [Neu,I] Mix[Neu,I,O]

[Neu,I,R]

Mix

Mix

5 [Unc,C] NonFeed[Unc,C,O]

[Unc,C,R]

NonFeed

NonFeed

6 [Unc,I] NonFeed[Unc,I,O]

[Unc,I,R]

NonFeed

NonFeed

4 Shifts

Page 34: Using Reinforcement Learning to Build a Better Model of Dialogue State

Frustration Policies

Trend: if student is frustrated (F), give NonFeed

B2 B2 Policy B2 +Frustration +Frustration Policy

1 [Cer,C] NonFeed[Cer,C,N]

[Cer,C,F]

NonFeed

Feed

2 [Cer,I] NonFeed[Cer,I,N]

[Cer,I,F]

NonFeed

NonFeed

3 [Neu,C] Feed[Neu,C,N]

[Neu,C,F]

Feed

NonFeed

4 [Neu,I] Mix[Neu,I,N]

[Neu,I,F]

Mix

NonFeed

5 [Unc,C] NonFeed[Unc,C,N]

[Unc,C,F]

NonFeed

NonFeed

6 [Unc,I] NonFeed[Unc,I,N]

[Unc,I,F]

NonFeed

NonFeed

4 Shifts

Page 35: Using Reinforcement Learning to Build a Better Model of Dialogue State

Percent Correct Policies3 Shifts

Trend: if student is a low performer (L), give NonFeed

B2 B2 Policy B2 +%Correct +%Correct Policy

1 [Cer,C] NonFeed[Cer,C,H]

[Cer,C,L]

NonFeed

NonFeed

2 [Cer,I] NonFeed[Cer,I,H]

[Cer,I,L]

Mix

NonFeed

3 [Neu,C] Feed[Neu,C,H]

[Neu,C,L]

Feed

Feed

4 [Neu,I] Mix[Neu,I,H]

[Neu,I,L]

NonFeed

Mix

5 [Unc,C] NonFeed[Unc,C,H]

[Unc,C,L]

Mix

NonFeed

6 [Unc,I] NonFeed[Unc,I,H]

[Unc,I,L]

NonFeed

NonFeed

Page 36: Using Reinforcement Learning to Build a Better Model of Dialogue State

Discussion

Incorporating more information into a representation of the student state has an impact on tutor policies

Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies

Including Certainty, Student Moves and Concept Repetition effected the most change

Page 37: Using Reinforcement Learning to Build a Better Model of Dialogue State

Future Work

Developing user simulations and annotating more human-computer experiments to further verify our policies are correct

More data allows us to develop more complicated policies such as More complex tutor actions (hints, questions) Combinations of state features More refined reward functions (PARADISE)

Developing more complex convergence tests

Page 38: Using Reinforcement Learning to Build a Better Model of Dialogue State

Related Work

[Paek and Chickering, ‘05] [Singh et al., ‘99] – optimal dialogue length [Frampton et al., ‘05] – last dialogue act [Williams et al., ‘03] – automatically generate

good state/action sets

Page 39: Using Reinforcement Learning to Build a Better Model of Dialogue State

Diff Plots

Diff Plot: compare final policy (20 students) with policies generated at smaller cuts


Top Related