optimizing dialogue management with reinforcement learning: experiments with the njfun spoken...

Optimizing Dialogue Management with

Reinforcement Learning: Experiments with the NJFun

Spoken Dialogue SystemDiane Litman

AT&T Labs - ResearchFlorham Park, NJ 07932

http://www.research.att.com/~diane

Research Motivations• Builders of real-time spoken dialogue systems

face fundamental design choices that strongly influence system performance – automatic optimization of default dialogue

behavior via reinforcement learning? [COLING-00, AAAI-00] (and this talk)

– personalization via supervised learning and voice control [ACL-99, UM-99, NAACL-00, AAAI-00, NAACL-01]

• New types of voice-enabled systems pose many interesting technical challenges (e.g. chat interfaces )

Outline

• Spoken dialogue systems• Reinforcement learning for dialogue • The NJFun application• Empirical evaluation• Discussion

Spoken Dialogue Systems• Provide voice access to back end via

telephone or microphone• Front end: ASR (automatic speech

recognition) and TTS (text to speech)• Back end: DB, web, etc.• Middle: dialogue policy (what action to

take at each point in a dialogue)

user

ASR

TTS

DBdialoguemanager

RLDS DialoguesTask Description:"You have lived in Stanhope for many years but have never managed to visit its historic sites. Today you are feeling virtuous.

Use RLDS to find out what you can see this morning.”

TTS: Welcome to RLDS. Please say an activity name or say 'list activities' for a list of activities I know about. USER: I would like to visit a historic site in Stanhope this morning. TTS: Did you say you are interested in going to a zoo? USER: No. TTS: I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list. USER: I would like to visit a historic site. TTS: Did you say you are interested in going to a historic site? USER: Yes. TTS: Please give me more information. USER: ... TTS: Please tell me the location that you are interested in. You can also tell me the time.…. (8 turns omitted)TTS: Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Bad. TTS: Goodbye and have a nice day!

TTS: Welcome to RLDS. How may I help you? USER: I would like to um find about Stanhope historic sites. TTS: Did you say you are interested in Stanhope? USER: Yes. TTS: What time of the day do you want to go? USER: Uh...morning. TTS: Did you say you want to go in the morning? USER: Yes. TTS: I found 2 historic sites in Stanhope that are open in the morning. They are the Lenape Indian Village, and the Waterloo Village. Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Good. TTS: Goodbye and have a nice day!

Unoptimized System Optimized System

Automatic Speech Recognition (ASR)

• Inputs: audio file; grammar/language model; acoustic model

• Outputs: utterance matched from grammar, or no match; confidence score

• Performance tradeoff:– “small” grammar --> high accuracy on

constrained utterances, lots of no-matches– “large” grammar --> match more

utterances, but with lower confidence

Some Issues in Dialogue Policy Design

• Initiative policy• Confirmation policy• Criteria to be optimized

Initiative Policy

• System initiative vs. user initiative:– “Please state your departure city.”– “How can I help you?”

• Influences expectations• ASR grammar must be chosen accordingly• Best choice may differ from state to state• May depend on user population & task

Confirmation Policy

• High ASR confidence: accept ASR match and move on

• Moderate ASR confidence: confirm• Low ASR confidence: re-ask• How to set confidence thresholds?• Early mistakes can be costly later,

but excessive confirmation is annoying

Criteria to be Optimized

• Task completion• Sales revenues• User satisfaction• ASR performance• Number of turns

Typical System Design: Sequential Search

• Choose and implement several “reasonable” dialogue policies

• Field systems, gather dialogue data • Do statistical analyses• Refield system with “best” dialogue

policy• Can only examine a handful of policies

Why Reinforcement Learning?

• Agents can learn to improve performance by interacting with their environment

• Thousands of possible dialogue policies, and want to automate the choice of the “optimal”

• Can handle many features of spoken dialogue– noisy sensors (ASR output)– stochastic behavior (user population)– delayed rewards, and many possible rewards– multiple plausible actions

• However, many practical challenges remain

Our Approach• Build initial system that is deliberately

exploratory wrt state and action space• Use dialogue data from initial system to

build a Markov decision process (MDP)• Use methods of reinforcement learning

to compute optimal policy (here, dialogue policy) of the MDP

• Refield (improved?) system given by the optimal policy

• Empirically evaluate

State-Based Design• System state: contains information

relevant for deciding the next action– info attributes perceived so far– individual and average ASR confidences– data on particular user– etc.

• In practice, need a compressed state• Dialogue policy: mapping from each

state in the state space to a system action

Markov Decision Processes

• System state s (in S)• System action a in (in A)• Transition probabilities P(s’|s,a)• Reward function R(s,a) (stochastic)• Our application: P(s’|s,a) models

the population of users

SDSs as MDPs

...332211 ususus

Initial systemutterance

Initial userutterance

Actions haveprob. outcomes

estimate transition probabilities... P(next state | current state & action)...and rewards... R(current state, action)...from set of exploratory dialogues (random action choice)Violations of Markov property! Will this work?

a e a e a e ...1 21 2 33

+ system logs

Computing the Optimal

• Given parameters P(s’|s,a), R(s,a), can efficiently compute policy maximizing expected return

• Typically compute the expected cumulative reward (or Q-value) Q(s,a), using value iteration

• Optimal policy selects the action with the maximum Q-value at each dialogue state

Potential Benefits• A principled and general framework for

automated dialogue policy synthesis – learn the optimal action to take in each state

• Compares all policies simultaneously– data efficient because actions are evaluated

as a function of state– traditional methods evaluate entire policies

• Potential for “lifelong learning” systems, adapting to changing user populations

The Application: NJFun• Dialogue system providing telephone

access to a DB of activities in NJ• Want to obtain 3 attributes:

– activity type (e.g., wine tasting)– location (e.g., Lambertville)– time (e.g., morning)

• Failure to bind an attribute: query DB with don’t-care

The State SpaceFeature Values Explanation Attribute (A) 1,2,3 Which attribute is being worked on

Confi dence/ Confi rmed (C)

0,1,2 3,4

0,1,2 f or low, medium and high ASR confi dence 3.4 f or explicitly confi rmed, disconfi rmed

Value (V) 0,1 Whether value has been obtained f or current attribute

Tries (T) 0,1,2 How many times current attr has been asked

Grammar (G) 0,1 Whether open or closed grammar was used

History (H) 0,1 Whether trouble on any previous attribute

N.B. Non-state variables record attribute values;state does not condition on previous attributes!

Sample Action Choices

• Initiative (when T = 0)– user (open prompt and grammar)– mixed (constrained prompt, open grammar)– system (constrained prompt and grammar)

• Example– GreetU: “How may I help you?” – GreetS: “Please say an activity name.”

Sample Confirmation Choices

• Confirmation (when V = 1)– confirm– no confirm

• Example– Conf3: “Did you say want to go in the

<time>?”– NoConf3: “”

Dialogue Policy Class• Specify “reasonable” actions for each

state– 42 choice states (binary initiative or

confirmation action choices)– no choice for all other states

• Small state space (62), large policy space (2^42)

• Example choice state– initial state: [1,0,0,0,0,0]– action choices: GreetS, GreetU

• Learn optimal action for each choice state

Some System Details• Uses AT&T’s WATSON ASR and TTS

platform, DMD dialogue manager• Natural language web version used

to build multiple ASR language models

• Initial statistics used to tune bins for confidence values, history bit (informative state encoding)

The Experiment• Designed 6 specific tasks, each with web survey• Split 75 internal subjects into training and test, controlling

for M/F, native/non-native, experienced/inexperienced• 54 training subjects generated 311 dialogues• Training dialogues used to build MDP• Optimal policy for BINARY TASK COMPLETION computed

and implemented• 21 test subjects (for modified system) generated 124

dialogues• Did statistical analyses of performance changes

Reward Function• Binary task completion (objective measure):

– 1 for 3 correct bindings, else -1• Task completion (allows partial credit):

– -1 for an incorrect attribute binding– 0,1,2,3 correct attribute bindings

• Other evaluation measures: ASR performance (objective), and phone feedback, perceived completion, future use, perceived understanding, user understanding, ease of use (all subjective)

• Optimized for binary task completion, but predicted improvements in other measures

Main Results• Task completion (-1 to 3):

– train mean = 1.72– test mean = 2.18– p-value < 0.03

• Binary task completion:– train mean = 51.5 %– test mean = 63.5 %– p-value < 0.06

Other Results

• ASR performance (0-3):– train mean = 2.48 – test mean = 2.67 – p-value < 0.04

• Binary task completion for experts (dialogues 3-6):– train mean = 45.6%– test mean = 68.2 %– p-value < 0.01

Subjective Measures

Subjective measures“move to the middle” rather thanimprove

First graph: It was easy to find the place that I wanted (strongly agree = 5,…, strongly disagree=1)train mean = 3.38, test mean = 3.39, p-value = .98

Comparison to Human Design• Fielded comparison infeasible, but

exploratory dialogues provide a Monte Carlo proxy of “consistent trajectories”

• Test policy: Average binary completion reward = 0.67 (based on 12 trajectories)

• Outperforms several standard fixed policies– SysNoConfirm: -0.08 (11)– SysConfirm: -0.6 (5)– UserNoConfirm: -0.2 (15)– Mixed: -0.077 (13)– User Confirm: 0.2727 (11), no difference

A Sanity Check of the MDP• Generate many random policies • Compare value according to MDP and value based on

consistent exploratory trajectories• MDP evaluation of policy: ideally perfectly accurate

(infinite Monte Carlo sampling), linear fit with slope 1, intercept 0

• Correlation between Monte Carlo and MDP:– 1000 policies, > 0 trajs: cor. 0.31, slope 0.953, int.

0.067, p < 0.001– 868 policies, > 5 trajs: cor. 0.39, slope 1.058, int.

0.087, p < 0.001

Future Work

• Automate choice of states and actions

• Scale to more complex systems• POMDPs due to hidden state• Learn terminal (and non-terminal)

reward function• Online rather than batch learning

Related Work• Biermann and Long (1996)• Levin, Pieraccini, and Eckert (1997) • Walker, Fromer, and Narayanan (1998)• Singh, Kearns, Litman, and Walker

(1999)• Scheffler and Young (2000)• Beck, Woolf, and Beal (2000)• Roy, Pineau, and Thrun (2000)

Conclusions• MDPs and RL are a promising framework

for automated dialogue policy design• Practical methodology for system-building

– given a relatively small number of exploratory dialogues, learn the optimal policy within a large policy search space

• Our application: first empirical test of formalism

• Resulted in measurable and significant system improvements, as well as interesting linguistic results

optimizing dialogue management with reinforcement learning: experiments with the njfun spoken...

Documents

tts text

user initiative

stanhope historic sites

dialogue management

supervised learning

nice day

list activities

list of activities