a non-strict hierarchical reinforcement learning for interactive

Download A Non-Strict Hierarchical Reinforcement Learning for Interactive

Post on 01-Jan-2017




0 download

Embed Size (px)


  • A

    Non-Strict Hierarchical Reinforcement Learning for InteractiveSystems and Robots

    HERIBERTO CUAYAHUITL, Heriot-Watt University, United KingdomIVANA KRUIJFF-KORBAYOVA, German Research Centre for Artificial Intelligence, GermanyNINA DETHLEFS, Heriot-Watt University, United Kingdom

    Conversational systems and robots that use reinforcement learning for policy optimization in large domains often face theproblem of limited scalability. This problem has been addressed either by using function approximation techniques thatestimate the approximate true value function of a policy or by using a hierarchical decomposition of a learning task intosubtasks. We present a novel approach for dialogue policy optimization that combines the benefits of both hierarchicalcontrol and function approximation and that allows flexible transitions between dialogue subtasks to give human users morecontrol over the dialogue. To this end, each reinforcement learning agent in the hierarchy is extended with a subtask transitionfunction and a dynamic state space to allow flexible switching between subdialogues. In addition, the subtask policies arerepresented with linear function approximation in order to generalize the decision making to situations unseen in training.Our proposed approach is evaluated in an interactive conversational robot that learns to play quiz games. Experimentalresults, using simulation and real users, provide evidence that our proposed approach can lead to more flexible (natural)interactions than strict hierarchical control and that it is preferred by human users.

    Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning Reinforcement and Supervised Learning;I.2.7 [Artificial Intelligence]: Natural Language Processing Conversational Interfaces

    General Terms: Algorithms, Design, Experimentation, Performance

    Additional Key Words and Phrases: interactive robots, spoken dialogue systems, human-robot interaction, machine learning,reinforcement learning, hierarchical control, function approximation, user simulation, flexible interaction

    ACM Reference Format:Heriberto Cuayahuitl, Ivana Kruijff-Korbayova, Nina Dethlefs, 2014. Non-Strict Hierarchical Reinforcement Learning forFlexible Interactive Systems and Robots. ACM Trans. Interact. Intell. Syst. 4, 3, Article A (September 2014), 25 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

    1. INTRODUCTION AND MOTIVATIONThere is a shared belief in the artificial intelligence community that machine learning techniqueswill play an important role in the development of intelligent interactive systems and robots. This isattributed to the expectation that interactive systems and robots will incorporate increasing amountsof learning skills rather than hand-coded behaviors. In this article, we focus on the application of Re-inforcement Learning (RL) in order to learn behaviors from interactions in an efficient, effective andnatural way. The RL framework has been an attractive and promising alternative to hand-coded poli-cies for the design of trainable and adaptive dialogue agents. An RL agent learns its behavior frominteraction with an environment and the people within it, where situations are mapped to actionsby maximizing a long-term reward signal [Sutton and Barto 1998; Szepesvari 2010]. Since spoken

    This work was carried out while the first author was at the German Research Center for Artificial Intelligence (DFKI GmbH)and was supported by the European Union Integrated Project ALIZ-E (FP7-ICT-248116).Authors addresses: Heriberto Cuayahuitl, Heriot-Watt University, School of Mathematical and Computer Sciences, Ed-inbugh, UK; Ivana Kruijff-Korbayova, German Research Center for Artificial Intelligence (DFKI GmbH), Saarbrucken,Germany; and Nina Dethlefs, Heriot-Watt University, School of Mathematical and Computer Sciences, Edinbugh, UK.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on thefirst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by othersthan ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, toredistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee.Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701USA, fax +1 (212) 869-0481, or permissions@acm.org.c 2014 ACM 2160-6455/2014/09-ARTA $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

    ACM Transactions on Interactive Intelligent Systems, Vol. 4, No. 3, Article A, Pub. date: September 2014.

  • A:2 Cuayahuitl et al.

    dialogue management was first framed as an optimization problem [Levin et al. 2000; Walker 2000;Young 2000; Singh et al. 2002], this field has experienced important progress along three mainstrands of work: scalability, robustness, and applicability. The first strand (scalability) addresses thefact that the state space growth is exponential in the number of state variables (also referred to as thecurse of dimensionality problem). Attempts to solve this problem have involved replacing tabularrepresentations with function approximators [Henderson et al. 2008; Li et al. 2009; Jurccek et al.2011], and dividing an agent into a hierarchy of agents [Cuayahuitl et al. 2007; Cuayahuitl et al.2010]. The second strand (robustness) addresses the problem that the dialogue agent typically oper-ates under uncertainty the most obvious sources are speech and visual recognition errors, but theyare not the only ones. Attempts to solve this problem involve finding a mapping from belief states(a probability distribution over states) to actions and have focused on keeping belief monitoringtractable by compressing the belief state [Roy et al. 2000; Williams 2007; Thomson 2009; Younget al. 2010; Crook et al. 2012]. The third strand (applicability) addresses interactive agents learningfrom real interactions. Advances have been limited here since learning algorithms usually requiremany dialogues to find optimal dialogue policies. Nevertheless, first attempts to solve this probleminvolve learning from real dialogue (for small policies) [Singh et al. 2002], learning from simula-tions (most of the literature), batch learning (offline) rather than online learning from interactions[Pietquin et al. 2011], fast policy learning [Gasic and Young 2011] with policy reuse [Cuayahuitland Dethlefs 2011], and learning from human demonstrations [Thomaz and Breazeal 2006].

    While reinforcement learning dialogue systems are thus very promising, they still need to over-come several limitations to reach practical and wide-spread application in the real world. One ofthese limitations is the fact that user simulations need to be as realistic as possible, so that dialoguepolicies will not overfit the simulated interactions with poor generalization to real interactions. An-other limitation is that attempts to address the curse of dimensionality often involve rule-basedreductions of the state-action space [Litman et al. 2000; Singh et al. 2002; Heeman 2007; Williams2008; Cuayahuitl et al. 2010; Dethlefs et al. 2011]. This can lead to reduced flexibility of systembehavior in terms of not letting the user take initiative in the dialogue to say and/or do anythingat any time. Finally, even when function approximation techniques have been used to scale up insmall-scale and single-task systems [Henderson et al. 2008; Li et al. 2009; Pietquin 2011; Jurcceket al. 2011], their application to more complex dialogue contexts has yet to be demonstrated.

    The research question that we address in this article is how to optimize reinforcement learningdialogue systems with multiple subdialogues for flexible human-machine interaction? For exam-ple, in the travel planning domain a user may wish to switch back and forth between hotel booking,flight booking and car rental subdialogues. Our motivation to aim for increased dialogue flexibilityis the assumption that users at times deviate from the systems expected user behavior. In reducedstate spaces this may lead to unseen dialogue states in which the system cannot react properly tothe new situation (e.g. asking for the availability of rental cars during a hotel booking subdialogue).Thus, while a full state space represents maximal flexibility (according to the state variables takeninto account), but is often not scalable, reducing the state space for increased scalability simultane-ously faces the risk of reducing dialogue flexibility. Since finding the best state-action space for alearning agent is a daunting task, we suggest in this article that learning agents should optimize sub-dialogues and allow flexible transitions across them rather than optimizing whole dialogues. Thisis a novel approach that has not been explored before, but that will increase both dialogue flexibil-ity and scalability. Our approach is couched within a Hierarchical Reinforcement Learning (HRL)framework, a principled and scalable model for optimizing sub-behaviors [Barto and Mahadevan2003]. We extend an existing HRL algorithm with the following features:

    (1) instead of imposing strict hierarchical dialogue control, we allow users to navigate more flexiblyacross the available subdialogues (using non-strict hierarchical control); and

    (2) we represent the dialogue policy using function approximation in order to generalize thedecision-making even to situations unseen in training.

    ACM Transactions on Interactive Intelligent Systems, Vol. 4, No. 3, Article A, Pub. date: September 2014.

  • Non-Strict Hierarchical Reinforcement Learning for Interactive Systems and Robots A:3



View more >