inverse reinforcement learning via nonparametric spatio ...an implicit intentional model of the...

Journal of Machine Learning Research 19 (2018) 1-45 Submitted 3/18; Revised 10/18; Published 10/18

Inverse Reinforcement Learningvia Nonparametric Spatio-Temporal Subgoal Modeling

Adrian Sosic [email protected] M. Zoubir [email protected] Processing GroupTechnische Universitat Darmstadt64283 Darmstadt, Germany

Elmar Rueckert [email protected] for Robotics and Cognitive SystemsUniversity of Lubeck23538 Lubeck, Germany

Jan Peters [email protected] Systems LabsTechnische Universitat Darmstadt64289 Darmstadt, Germany

Heinz Koeppl [email protected]

Bioinspired Communication Systems

Technische Universitat Darmstadt

64283 Darmstadt, Germany

Editor: George Konidaris

Abstract

Advances in the field of inverse reinforcement learning (IRL) have led to sophisticatedinference frameworks that relax the original modeling assumption of observing an agentbehavior that reflects only a single intention. Instead of learning a global behavioral model,recent IRL methods divide the demonstration data into parts, to account for the fact thatdifferent trajectories may correspond to different intentions, e.g., because they were gener-ated by different domain experts. In this work, we go one step further: using the intuitiveconcept of subgoals, we build upon the premise that even a single trajectory can be ex-plained more efficiently locally within a certain context than globally, enabling a morecompact representation of the observed behavior. Based on this assumption, we buildan implicit intentional model of the agent’s goals to forecast its behavior in unobservedsituations. The result is an integrated Bayesian prediction framework that significantlyoutperforms existing IRL solutions and provides smooth policy estimates consistent withthe expert’s plan. Most notably, our framework naturally handles situations where theintentions of the agent change over time and classical IRL algorithms fail. In addition, dueto its probabilistic nature, the model can be straightforwardly applied in active learningscenarios to guide the demonstration process of the expert.

Keywords: Learning from Demonstration, Inverse Reinforcement Learning, BayesianNonparametric Modeling, Subgoal Inference, Graphical Models, Gibbs Sampling

c©2018 Adrian Sosic, Elmar Rueckert, Jan Peters, Abdelhak M. Zoubir and Heinz Koeppl.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v19/18-113.html.

https://creativecommons.org/licenses/by/4.0/

http://jmlr.org/papers/v19/18-113.html

Sosic, Rueckert, Peters, Zoubir and Koeppl

1. Introduction

Inverse reinforcement learning (IRL) refers to the problem of inferring the intention of anagent, called the expert, from observed behavior. Under the Markov decision process (MDP)formalism (Sutton and Barto, 1998), that intention is encoded in the form of a reward func-tion, which provides the agent with instantaneous feedback for each situation encounteredduring the decision-making process. Classical IRL methods (Ng and Russell, 2000; Abbeeland Ng, 2004; Ziebart et al., 2008; Ramachandran and Amir, 2007; Levine et al., 2011)assume there exists a single global reward model that explains the entire set of demonstra-tions provided by the expert. In order to relax this rather restrictive modeling assumption,recent IRL methods allow that the agent’s intention can change over time (Nguyen et al.,2015), or they presume that the demonstration data set is inherently composed of severalparts (Dimitrakakis and Rothkopf, 2011), where different trajectories reflect the intentionsof different domain experts.

In this work, we go a step further and start from the premise that— even in the case ofa single expert or trajectory— the demonstrated behavior can be explained more efficientlylocally (i.e., within a certain context) than by a global reward model. As an illustrativeexample, we may consider the task shown in Figure 1a, where the expert approaches a set ofintermediate target positions before finally heading toward a global goal state. Similarly, inFigure 1b, the agent eventually returns to its initial position, from where the cyclic processrepeats. Despite the simplicity of these tasks, the encoding of such behaviors in a globalintention model requires a reward structure that comprises a comparably large numberof redundant state-action-based rewards. Alternative modeling strategies rely on task-dependent expansions of the agent’s state representation, e.g., to memorize the last visitedgoal (Krishnan et al., 2016), or they resort to more general decision-making frameworks likesemi-MDPs/options (Bradtke and Duff, 1994; Sutton et al., 1999) in order to achieve thenecessary level of task abstraction.

In this paper, we present a substantially simpler modeling framework that requires onlyminimal adaptations to the standard MDP formalism but comes with a hypothesis spaceof behavioral models that is sufficiently large to cover a broad class of expert policies.The key insight that motivates our approach is that many tasks, like those in Figure 1,can be decomposed into smaller subtasks that require considerably less modeling effort.The resulting low-level task descriptions can then be used as building blocks to synthesizearbitrarily complex behavioral strategies through a suitable sequencing of subtasks. Thisoffers the possibility to learn comparably simple task representations using the intuitiveconcept of subgoals, which is achieved by efficiently encoding the expert behavior usingtask-adapted partitionings of the system state space/the expert data.

The proposed framework builds upon the method of Bayesian nonparametric inversereinforcement learning (BNIRL, Michini and How, 2012), which can be used to build asubgoal representation of a task based on demonstration data— however, without learningthe underlying subgoal relationships or providing a policy model that can generalize thestrategy of the demonstrator. In order to address this limitation, we generalize the BNIRLmodel using insights from our previous works on nonparametric subgoal modeling (Sosicet al., 2018a) and policy recognition (Sosic et al., 2018b), building a compact intentionalmodel of the expert’s behavior that explicitly describes the local dependencies between the

2

IRL via Nonparametric Spatio-Temporal Subgoal Modeling

(a) sequenced target positions(Michini and How, 2012)

(b) cyclic behavior

Figure 1: Two simple behavior examples that motivate the subgoal principle. The settingis based on the grid world dynamics described in Section 5.1. In both cases,a task description based on a global reward function is inefficient as it requiresmany state-action-based rewards to explain the observed trajectory structures.However, the data can be described efficiently through subgoal-based encodings.Both scenarios are analyzed in detail in Section 5.

demonstrations and the underlying subgoal structure. The result is an integrated Bayesianprediction framework that exploits the spatio-temporal context of the demonstrations andis capable of producing smooth policy estimates that are consistent with the expert’s plan.Furthermore, capturing the full posterior information of the data set enables us to applythe proposed approach in an active learning setting, where the data acquisition process iscontrolled by the posterior predictive distribution of our model.

In our experimental study, we compare the proposed approach with common baselinemethods on a variety of benchmark tasks and real-world scenarios. The results reveal thatour approach performs significantly better than the original BNIRL model and alternativeIRL solutions on all considered tasks. Interestingly enough, our algorithm outperforms thebaselines even when the expert’s true reward structure is dense and the underlying subgoalassumption is violated.

1.1 Related Work

The idea of decomposing complex behavior into smaller parts has been around for long andresearchers have approached the problem in many different ways. While the overall field ofmethods is too large to be covered here, most existing approaches can be clearly categorizedaccording to certain criteria. Often, two approaches differ in their exact problem formu-lation, i.e., we can distinguish between active methods, where the learning algorithm caninteract freely with the environment (e.g., hierarchical reinforcement learning, Botvinick,2012; Al-Emran, 2015), and passive methods, where the behavioral model is trained solelythrough observation (learning from demonstration, Argall et al., 2009). Furthermore, we

3


can discriminate between methods that build an explicit intentional model of the under-lying task (IRL and option-based models, Choi and Kim, 2012; Sutton et al., 1999), andsuch that work directly on the control/trajectory level (skill learning, movement primitives,Konidaris et al., 2012; Schaal et al., 2005). The latter distinction is sometimes also referredto as intentional/subintentional approaches (Albrecht and Stone, 2017; Panella and Gmy-trasiewicz, 2017). In order to give a concise summary of the work that is most relevant toours, we restrict ourselves to passive approaches, with a focus on intentional methods, thefield of which is considerably smaller. For an overview of active approaches, we refer toexisting literature, e.g., the work by Daniel et al. (2016a).

First, there is the class of methods that pursue a decomposition of the observed behav-ior on the global level, using trajectory-based IRL approaches. For example, Dimitrakakisand Rothkopf (2011) proposed a hierarchical prior over reward functions to account for thefact that different trajectories in a data set could reflect different behavioral intentions,e.g., because they were generated by different domain experts. Similarly, Babes-Vromanet al. (2011) follow an expectation-maximization-based clustering approach to group indi-vidual trajectories according to their underlying reward functions. Choi and Kim (2012)generalized this idea by proposing a nonparametric Bayesian model in which the number ofintentions is a priori unbounded.

While the above methods consider the expert data at a global scale, our work is con-cerned with the problem of subgoal modeling, which is often conducted in the form ofoption-based reasoning (Sutton et al., 1999). For instance, Tamassia et al. (2015) proposeda clustering approach based on state distances to find a minimal set of options that can ex-plain the expert behavior. While the method provides a simple alternative to handcraftingoptions, it does not allow any probabilistic treatment of the data and involves many ad-hocdesign choices. Going in the same direction, Daniel et al. (2016a) presented a more prin-cipled, probabilistic option framework based on expectation-maximization. Not only is theframework capable of inferring sub-policies automatically, it can be also used in a reinforce-ment learning context for intra-option learning. However, the resulting behavioral modelis based on point estimates of the policy parameters, and the number of sub-policies needsto be specified manually. The latter problem was solved by Krishnan et al. (2016), whoproposed a hierarchical nonparametric IRL framework to learn a sequential representationof the demonstrated task, based on a set of transition regions that are defined through localchanges in linearity of the observed behavior. However, in contrast to the work by Danielet al. (2016a), inference is not performed jointly but in several isolated stages where, again,each stage only propagates a point estimate of the associated model parameters. Moreover,the temporal relationship of the demonstration data, used to identify the local linearitychanges, is considered only in an ad-hoc fashion with the help of a windowing function.

Another general class of models, which explicitly addresses this issue, employs a hiddenMarkov model (HMM) structure to establish a temporal relationship between the demon-strations. For instance, the work presented by Nguyen et al. (2015) can be regarded as ageneralization of the model by Babes-Vroman et al. (2011), which extends the expectation-maximization framework by imposing a Markov structure on the reward model. Similarly,Niekum et al. (2012) use an extended HMM to segment the demonstrations into vector au-toregressive models, in order to learn a suitable set of movement primitives. However, thelearning of those primitives is done in a post-processing step, meaning that the quality of

4


the final representation crucially depends on the success of the initial segmentation stage.In contrast, the method by Rueckert et al. (2013) automatically learns the position andtiming of subgoals in the form of via-points, but the number of via-points is assumed to beknown and the system objective gets finally encoded in form of a global cost function. Re-cently, Lioutikov et al. (2017) presented a related approach based on probabilistic movementprimitives that jointly solves the segmentation and learning step for an unknown number ofprimitives, using an expectation-maximization framework. Yet, the model operates purelyon the trajectory level and cannot reveal the latent intentions of the demonstrator. Anothervariant of the approach by Niekum et al. (2012) that explicitly addresses this problem wasproposed by Surana and Srivastava (2014). In their paper, the authors propose to replacethe HMM emission model with an MDP model, in order to infer a policy model from thesegmented trajectories instead of recognizing changes in the dynamics. The model was laterextended by Ranchod et al. (2015), who augmented the HMM representation with a betaprocess model to facilitate skill sharing across trajectories. While the resulting model for-mulation is highly flexible, its major drawback is that inference becomes computationallyexpensive as it involves multiple IRL iterations per Gibbs step.

In contrast to the HMM-based solutions, which by their sequential nature focus on thetemporal relationship of subtasks, the approach presented in this paper establishes a moregeneral correlation structure between demonstrations by employing non-exchangeable priordistributions over subgoal assignments, i.e., without committing to purely temporal factor-izations of subgoals. This results in a compact model representation (e.g., it avoids the needof estimating latent subgoal transition probabilities required in an HMM structure) and addsthe flexibility to capture both, the temporal and the spatial dependencies between subtasks.

1.2 Paper Outline

The organization of the paper is as follows: in Section 2, we briefly revisit the BNIRL modeland discuss its limitations, which forms the basis for our work. Section 3 then introduces anew intentional subgoal framework, which addresses the shortcomings of BNIRL discussedin Section 2. In Section 4, we derive a sampling-based inference scheme for our model andexplain how the new framework can be used for subgoal extraction and action prediction.Experimental results on both synthetic and real-world data are presented in Section 5 beforewe finally conclude our work in Section 6.

2. Bayesian Nonparametric Inverse Reinforcement Learning

The purpose of this section is to recapitulate the principle of Bayesian nonparametric inversereinforcement learning. After briefly discussing all building blocks of the model, we focuson the limitations of the framework, which motivates the need for an extended modelformulation and finally leads to a new inference approach, presented afterwards in Section 3.

2.1 Revisiting the BNIRL Framework

Following the common IRL paradigm (Ng and Russell, 2000; Zhifei and Joo, 2012), the goalof BNIRL is to infer the intentions of an agent based on demonstration data. Starting froma standard MDP model, the problem is formalized on a finite state space S, assuming a

5


time-invariant state transition model T : S×S×A → [0, 1], where A is a finite set of actionsavailable to the agent at each state. For notational convenience, we represent the statesin S by the integer values {1, . . . , |S|}, where |S| denotes the cardinality of the state space.

In BNIRL, it is assumed that we can observe a number of expert demonstrations pro-vided in the form of state-action pairs, D := {(sd, ad)}Dd=1, where each pair (sd, ad) ∈ S ×Aconsists of a state sd visited by the agent and the corresponding action ad taken. Herein, Ddenotes the size of the demonstration set. Throughout the rest of this paper, we will use theshorthand notations s := {sd}Dd=1 and a := {ad}Dd=1 to access the collections of expert statesand actions individually. Note that the BNIRL model makes no assumptions about thetemporal ordering of the demonstrations, i.e., each state-action pair is considered to havearisen from a specific but arbitrary time instant of the agent’s decision-making process. Wewill come back to this point later in Sections 2.2 and 3.3.

In contrast to the classical MDP formalism and most other IRL frameworks, BNIRLdoes not presuppose that the observed expert behavior necessarily originates from a singleunderlying reward function. Instead, it introduces the concept of subgoals (and correspond-ing subgoal assignments) with the underlying assumption that, at each decision instant, theexpert selects a particular subgoal to plan the next action. Each subgoal is herein repre-sented by a certain reward function defined on the system state space; in the simplest case,it corresponds to a single reward mass placed at a particular goal state in S, which weidentify with a reward function Rg : S → {0, C} of the form

Rg(s) :=

{C if g = s,

0 otherwise,(1)

where g ∈ {1, . . . , |S|} indicates the subgoal location and C ∈ (0,∞) is some positiveconstant (compare Simsek et al., 2005; Stolle and Precup, 2002; Tamassia et al., 2015).

Although in principle it is legitimate to associate each subgoal with an arbitrary rewardstructure to encode more complex forms of goal-oriented behavior (see, for example, Ran-chod et al., 2015), the restriction to the reward function class in Equation (1) is sufficientin the sense that the same behavioral complexity can be synthesized through a combinationof subgoals. This is made possible by the nonparametric nature of BNIRL, i.e., becausethe number of possible subgoals is assumed to be unbounded. The use of the reward modelin Equation (1) has the advantage, however, that posterior inference about the expert’ssubgoals becomes computationally tractable, as will be explained in Section 4.5. In the fol-lowing, we therefore focus on the above reward model and summarize the infinite collectionof subgoals in the multiset G := {gk}∞k=1 ∈×∞k=1 S, where we adopt the assumption thatp(G | s) =

∏∞k=1 pg(gk | s).1

The subgoal assignment in BNIRL is achieved using a set of indicator variables z :={zd ∈ N}Dd=1, which annotate each demonstration pair (sd, ad) with its unique subgoal index.The prior distribution p(z) is modeled by a Chinese restaurant process (CRP, Aldous, 1985),

1. Notice that the subgoal prior distribution in the original BNIRL formulation does not take the statevariable s as an argument. Nonetheless, the authors of BNIRL suggest to restrict the support of thedistribution to the set of visited states, which indeed implies a conditioning on s.

6


which assigns the event that indicator zd points to the jth subgoal the prior probability

p(zd = j | z\d) ∝

{nj if j ∈ {1, . . . ,K},α if j = K + 1,

where z\d := {zd} \ zd is a shorthand notation for the collection of all indicator variablesexcept zd. Further, nj denotes the number of assignments to the jth subgoal in z\d, Krepresents the number of distinct entries in z\d, and α ∈ [0,∞) is a parameter controllingthe diversity of assignments.

Having targeted a particular subgoal gzd while being at some state sd, the expert isassumed to choose the next action ad according to a softmax decision rule, π : A×S×S →[0, 1], which weighs the expected returns of all actions against one another,

π(ad | sd, gzd) :=exp

{βQ∗(sd, ad | gzd)

}∑a∈A exp

{βQ∗(sd, a | gzd)

} . (2)

Herein, Q∗(s, a | g) denotes the state-action value (or Q-value, Sutton and Barto, 1998) ofaction a at state s under an optimal policy for the subgoal reward function Rg,

Q∗(s, a | g) := maxπ

E

[ ∞∑n=0

γnRg(st=n)∣∣∣ st=0 = s, at=0 = a, π

], (3)

where the expectation is with respect to the stochastic state-action sequence induced by thefixed policy π : S → A, with initial action a executed at the starting state s. The explicitnotation st=n and at=n is used to disambiguate the temporal index of the decision-makingprocess from the demonstration index of the state-action pairs {(sd, ad)}.

The softmax policy π models the expert’s (in-)ability to maximize the future expectedreturn in view of the targeted subgoal, while the coefficient β ∈ [0,∞) is used to expressthe expert’s level of confidence in the optimal action. Combined with the subgoal priordistribution pg and the partitioning model p(z), we obtain the joint distribution of alldemonstrated actions a, subgoals G, and subgoal assignments z as

p(a, z,G | s) = p(z)∞∏k=1

pg(gk | s)D∏d=1

π(ad | sd, gzd). (4)

The structure of this distribution is visualized in form of a Bayesian network in Figure 2a.It is worth emphasizing that π— although referred to as the likelihood model for the state-action pairs in the original BNIRL paper— is really just a model for the actions conditionalon the states. In contrast to what is stated in the original paper, the distribution in Equa-tion (4) therefore takes the form of a conditional distribution (i.e., conditional on s), whichdoes not provide any generative model for the state variables.

Posterior inference in BNIRL refers to the (approximate) computation of the condi-tional distribution p(z,G |D), which allows to identify potential subgoal locations and thecorresponding subgoal assignments based on the available demonstration data. For furtherdetails, the reader is referred to the original paper (Michini and How, 2012).

7


z zd

ad

sd

gk

D ∞

(a) BNIRL model

z

zi

sd

ad gk

D|S| ∞

(b) intermediate model

ci

zi

sd

ad gk

D|S|

|S|

∞

(c) ddBNIRL-S model

cd

td

zd

ad

sd

gk

D ∞D

(d) ddBNIRL-T model

Figure 2: Relationships between all discussed subgoal models, illustrated in the form ofBayesian networks. Shaded nodes represent observed variables; deterministicdependencies are highlighted using double strokes.

2.2 Limitations of BNIRL

Subgoal-based inference is a well-motivated approach to IRL and the BNIRL framework hasshown promising results in a variety of real-world scenarios. Yet, the model formulation byMichini and How (2012) comes with a number of significant conceptual limitations, whichwe explain in detail in the following paragraphs.

Limitation 1: Subgoal Exchangeability and Posterior Predictive Policy

The central limitation of BNIRL is that the framework is restricted to pure subgoal ex-traction and does not inherently provide a reasonable mechanism to generalize the expertbehavior based on the inferred subgoals. The reason lies in the particular design of theframework, which, at its heart, treats the subgoal assignments z as exchangeable randomvariables (Aldous, 1985). By implication, the induced partitioning model p(z) is agnos-tic about the covariate information contained in the data set and the resulting behavioralmodel is unable to propagate the expert knowledge to new situations.

To illustrate the problem, let us investigate the predictive action distribution that arisesfrom the original BNIRL formulation. For simplicity and without loss of generality, we mayassume that we have perfectly inferred all subgoals G and corresponding subgoal assign-ments z from the demonstration set D. Denoting by a∗ ∈ A the predicted action at some

8


new state s∗ ∈ S, the BNIRL model yields

p(a∗ | s∗,D, z,G) =∑z∗∈N

p(a∗, z∗ | s∗,D, z,G)

=∑z∗∈N

p(a∗ | z∗, s∗,D, z,G)p(z∗ | s∗,D, z,G)

(?)=∑z∗∈N

p(a∗ | z∗, s∗, gz∗)p(z∗ | z), (5)

where z∗ ∈ N is the latent subgoal index belonging to s∗. Note that p(a∗ | z∗, s∗, gz∗) caneither represent the softmax decision rule π(a∗ | s∗, gz∗) from Equation (2) or an optimal(deterministic) policy for subgoal gz∗ , depending on whether we aspire to describe the noisyexpert behavior at s∗ or want to determine an optimal action according to the inferred rewardmodel. The last equality in Equation (5), indicated by (?), follows from the conditionalindependence properties implied by Equation (4), which can be easily verified using d-separation (Koller and Friedman, 2009) on the graphical model in Figure 2a.

As Equation (5) reveals, the predictive model is characterized by the posterior dis-tribution p(z∗ | s∗,D, z,G) of the latent subgoal assignment z∗ of state s∗— the intuitionbeing that, in order to generalize the expert’s plan to a new situation, we need to takeinto account the gathered information about what would be a likely subgoal targeted bythe expert at s∗. However, in BNIRL, the distribution p(z∗ | s∗,D, z,G) is modeled withoutconsideration of the query state s∗, or any other observed variable. By conditional inde-pendence (Equation 4), the distribution effectively reduces (?) to the CRP prior p(z∗ | z),which, due to its intrinsic exchangeability property, only considers the subgoal frequenciesof the readily inferred assignments z. Clearly, a subgoal assignment mechanism based solelyon frequency information is of little use when it comes to predicting the expert behavior asit will inevitably ignore the structural information contained in the demonstration set andalways return the same subgoal probabilities at all query states, regardless of the agent’sactual situation. By contrast, a reasonable assignment mechanism should inherently takeinto account the context of the agent’s current state s∗ when deciding about the next action.

While the authors of BNIRL discuss the action selection problem in their paper and pro-pose an assignment strategy for new states based on action marginalization, their approachdoes not provide a satisfactory solution to the problem because the alleged conditioning onthe query state (see Equation 19 in the original paper, Michini and How, 2012) has no effecton the involved subgoal indicator variable, as shown by Equation (5) above. The only wayto remedy the problem without modifying the model is to use an external post-processingscheme like the waypoint method, discussed in the next section.

Limitation 2: Spatial and Temporal Context

The waypoint method, described at full length in a follow-up paper by Michini et al. (2015),is a post-processing routine to convert the subgoals identified through BNIRL into a validoption model (Sutton et al., 1999). The obtained model reconstructs the high-level planof the demonstrator by sequencing the inferred subgoals in a way that complies with thespatio-temporal relationships of the expert’s decisions as observed during the demonstrationphase. To this end, the required initiation and termination sets of the option-policies are

9


imperfect demonstrations bag-of-words clustering noisy trajectory labeling

Figure 3: A diagram to illustrate the implications of the exchangeability assumptionin BNIRL. Similar to a bag-of-words model (Blei et al., 2003; Yang et al., 2007),the BNIRL partitioning mechanism ignores the spatio-temporal context of thedata, which makes it difficult to discriminate demonstration noise from a realchange of the agent’s intentions. Note that the diagram illustrates the partition-ing process in a simplified way as it only shows the effect of the prior p(z) butneglects the impact of the likelihood model π. While the latter does indeed con-sider the state context of the actions, it cannot account for spatial or temporalpatterns in the data as it processes all state-action pairs separately.

constructed by considering the state distances to the identified subgoals as well as theirtemporal ordering prescribed by the expert.

When combined with BNIRL, this method allows to synthesize a behavioral modelthat mimics the observed expert behavior. However, the strategy comes with a number ofsignificant drawbacks:

(i) Using the waypoint method, the spatio-temporal relationships between the individ-ual demonstrations are explored only in a post-hoc fashion and are largely ignoredduring the actual inference procedure (the state information enters via the likelihoodmodel π but is not considered by the partitioning model p(z), as explained in Limita-tion 1). This lack of context-awareness makes the inference mechanism overly proneto demonstration noise (see Figure 3 and results in Section 5).

(ii) Measuring proximities to subgoals in order to determine the right visitation orderrequires some form of distance metric defined on the state space. If the system statescorrespond to physical locations, constructing such a metric is usually straightforward.However, in the general case where states encode arbitrary abstract information (seeexample in Section 5.2), it can become difficult to design that metric by hand. Un-fortunately, the BNIRL framework does not provide any solution to this problem.

(iii) The waypoint method cannot be applied to multiple unaligned trajectories (e.g., ob-tained from different experts) or in cases where the data set does not carry anytemporal information. This situation occurs, for instance, when the expert data isprovided as separate state-action pairs with unknown timestamps and not given inform of coherent trajectories (see again example in Section 5.2).

10


(iv) Assigning a particular visitation order to the inferred subgoals is meaningful only ifthe expert eventually reaches those subgoals during the demonstration phase (or if,at least, the subgoals lie “close” to the visited states in terms of the aforementioneddistance metric). Finding subgoals with such properties can be guaranteed by con-straining the support of the subgoal prior distribution pg to states that are near tothe expert data (see footnote on page 6) but this reduces the flexibility of the modeland potentially disables compact encodings of the task (Figure 4).

Limitation 3: Inconsistency under Time-Invariance

Reasoning about the intentions of an agent, there are two basic types of behavior one mayencounter:

• either the agent follows a static strategy to optimize a fixed objective (as assumed inthe standard MDP formalism, Sutton and Barto, 1998), or

• the intentions of the agent change over time.

The latter is clearly the more general case but also poses a more difficult inference problemin that it requires us both, to identify the intentions of the agent and to understand theirtemporal relationship. The static scenario, in contrast, implies that there exists an optimalpolicy for the task in form of a simple state-to-action mapping π : S → A (Puterman, 1994),which from the very beginning imprints a specific structure on the inference problem.

The BNIRL model generally falls into the second category since it freely allocates itssubgoals per decision instant and not per state, allowing a flexible change of the agent’sobjective. Yet, it is important to understand that the model does not actually distinguishbetween the two described scenarios. As explained in Limitation 2, the temporal aspectof the data is not explicitly modeled by the BNIRL framework, even though the waypointmethod subsequently tries to capture the overall chronological order of events. As a conse-quence, the model is not tailored to either of the two scenarios: on the one hand, it ignoresthe valuable temporal context that is needed in the time-varying case to reliably discrimi-nate demonstration noise from a real change of the agent’s intention. On the other hand,the model is agnostic about the predefined time-invariant nature of the optimal policy inthe static scenario. This lack of structure not only makes the inference problem harder thannecessary in both cases; it also allows the model to learn inconsistent data representationsin the static case since the same state can be potentially assigned to more than one subgoal,violating the above-mentioned state-to-action rule (Figure 5).

Limitation 4: Subgoal Likelihood Model

Apart from the discussed limitations of the BNIRL partitioning model, it turns out thereare two problematic issues concerning the softmax likelihood model in Equation (2). On thefollowing pages, we demonstrate that the specific form of the model encodes a number ofproperties that are indeed contradictory to our intuitive understanding of subgoals. Whilethese properties are less critical for the final prediction of the expert behavior, it turnsout they drastically affect the localization of subgoals. Since the cause of these effects issomewhat hidden in the model equation, we defer the detailed explanation to Section 3.1.

11


demonstrations local search global search

trajectory subgoals global goal

Figure 4: Difference between local (constrained) and global (unconstrained) subgoalsearch. The top and the bottom row depict two different sets of demonstrationdata (solid lines), together with potential goal/subgoal locations (crosses/circles)that explain the observed behavior. Color indicates the corresponding subgoalassignment of each trajectory segment. Top: two trajectories approaching thesame goal. Bottom: the agent is heading toward a global goal, gets temporarilydistracted, and then follows up on its original plan. Left: observed trajectories.Center: example partitioning under the assumption that the expert reachedall subgoals during the demonstration. Right: example partitioning withoutrestriction on the subgoal locations, yielding a more compact encoding of the task.

(a) time-varying intentions (b) time-invariant intentions

Figure 5: Schematic comparison of the two basic behavior types, illustrated using two dif-ferent agent trajectories. Color indicates the temporal progress. (a) Time-varyingintentions may cause the agent to perform a different action when revisiting astate (dotted circle). (b) By contrast, time-invariant intentions imply a simplestate-to-action policy: the agent has no incentive to perform a different actionat an already visited state since — by definition — the underlying objective hasremained unchained. Diverging actions, as observed at the crossing point in theleft subfigure, can therefore only be explained as a result of suboptimal behavior.

12


Limitation 5: State-Action Demonstrations

Lastly, a minor problem of the original BNIRL framework is that the inference algorithm ex-pects the demonstration data to be provided in the form of state-action pairs, which requiresfull access to the expert’s action record. This assumption is restrictive from a practical pointof view as it confines the application of the model to settings with laboratory-like conditionsthat allow a complete monitoring of the expert. For this reason, it is important to note thatan estimate of the expert’s action sequence can be recovered through BNIRL with the helpof an additional sampling stage (omitted in the original paper), provided that we know thesuccessor state reached by the expert after each decision. For the marginalized inferencescheme described in this paper, we present the corresponding sampling stage in Section 4.4.

3. Nonparametric Spatio-Temporal Subgoal Modeling

In this section, we introduce a redesigned inference framework, which, in analogy to BNIRL,we refer to as distance-dependent Bayesian nonparametric IRL (ddBNIRL). We derive themodel by making a series of modifications to the original BNIRL framework that addressthe previously described shortcomings on the conceptual level. Rethinking each part ofthe original framework, we begin with a discussion of the commonly used softmax actionselection strategy (Equation 2) in the context of subgoal inference, which finally leads to aredesign of the subgoal likelihood model (Limitation 4). Next, we focus on the subgoal allo-cation mechanism itself and introduce two closely related model formulations, each targetingone of the basic behavior types described in Figure 5, thereby addressing Limitations 1, 2and 3. For the time-invariant case, we begin with an intermediate model that introduces asubtle yet important structural modification to the BNIRL framework. In a second step,we generalize that new model to account for the spatial structure of the control problem,which finally allows us to extrapolate the expert behavior to unseen situations. As part ofthis generalization, we present a new state space metric that arises naturally in the contextof subgoal inference (see Limitation 2, second point). Lastly, we tackle the time-varyingcase and present a variant of the model that explicitly considers the temporal aspect of thesubgoal problem. A solution to Limitation 5 is discussed later in Section 4.

In contrast to BNIRL, both presented models can be used likewise for subgoal extractionand action prediction. Moreover, sticking with the Bayesian methodology, the presentedapproach provides complete posterior information at all levels.

3.1 The Subgoal Likelihood Model

Like many other approaches found in the (I)RL literature, BNIRL exploits a softmax weight-ing (Equation 2) to transform the Q-values of an optimal policy into a valid subgoal likeli-hood model. The softmax action rule has its origin in RL where it is known as the Boltz-mann exploration strategy (Cesa-Bianchi et al., 2017; Sutton and Barto, 1998), which iscommonly applied to cope with the exploration-exploitation dilemma (Ghavamzadeh et al.,2015). In recent years, however, it has also become the de facto standard for describing the(imperfect) decision-making strategy of an observed demonstrator (see, for example, Dimi-trakakis and Rothkopf, 2011; Ramachandran and Amir, 2007; Rothkopf and Dimitrakakis,2011; Choi and Kim, 2012; Neu and Szepesvari, 2007; Babes-Vroman et al., 2011).

13


In the following paragraphs, we focus on the implications of this model on the subgoalextraction problem and show that it contradicts our intuitive understanding of what char-acteristics a reasonable subgoal model should have. In particular, we argue that the subgoalposterior distribution arising from the BNIRL softmax model is of limited use for inferringthe latent intention of the agent, due to subgoal artifacts caused by the system dynamicsthat cannot be reconciled with the evidence provided by the demonstrations. Based onthese insights, we propose an alternative transformation scheme that is more consistentwith the subgoal principle.

3.1.1 Scale of the Reward Function

The first implication of the softmax likelihood model concerns the choice of the uncertaintycoefficient β. To explain the problem, we consider the thought experiment of an agentlocated at some state s targeting a particular subgoal g. The likelihood π(a | s, g) in Equa-tion (2) quantifies the probability that the agent decides for a specific action a, based on thecorresponding state-action values Q(·, s | g). Since those values are linear in the underlyingreward function Rg (Equation 3), the softmax likelihood model implies that the expert’sability to maximize the long-term reward, reflected by the spread of the probability massin π(· | s, g), rises with the magnitude C of the assumed subgoal reward (more concentratedprobability mass signifies a higher confidence in the action choice). In other words, assuminga higher goal reward virtually increases our level of confidence in the expert, even thoughthe difficulty of the underlying task and the optimal policy remain unchanged. Nonetheless,the BNIRL model requires us to readjust the uncertainty coefficient β in order to keep bothmodels consistent. However, as the model provides no reference level for the expert’s un-certainty across different scenarios, the choice of β becomes nontrivial. Yet, the parameterhas a significant impact on the granularity of the learned subgoal model as it trades offpurposeful goal-oriented behavior against random decisions.

Note that the described effect is not specific to the subgoal reward model in Equation (1)but is really a consequence of the softmax transformation in Equation (2). In fact, the sameproblem occurs when the model is applied in a regular MDP environment with arbitraryreward function, for example, when the agent is provided an additional constant rewardat all states. Clearly, such a constant reward provides no further information about theunderlying task and should hence not affect the agent’s belief about the optimal choice ofactions (compare discussion on constant reward functions and transformations of rewards,Ng and Russell, 2000; Ng et al., 1999). Based on these two observations, our intuition tellsus that we seek for a rationality model that is invariant to affine transformations of thereward signal, meaning that any two reward functions R : S → R and R := xR + y withx ∈ (0,∞), y ∈ R, should give rise to the same intentional representation. As we shall seein Section 3.1.3, this can be achieved by modeling the behavior of an agent based on therelative advantages of actions rather than on their absolute expected returns.

3.1.2 Impact of the Transition Dynamics

The second implication of the softmax likelihood model is less immediate and inherentlytied to the dynamics of the system. To explain the problem, we consider a scenario wherewe have a precise idea about the potential goals of the expert. For our example, we adopt

14


the grid world dynamics described in Section 5.1 and consider a simple upward-directedtrajectory of state-action pairs, which we aspire to explain using a single (sub-)goal. Thecomplete setting is depicted in Figure 6.

Intuitively, the shown demonstration set should lead to goals that are located in theupper region of the state space and concentrated around the vertical center line. Moreover,as we move away from that center line, we expect to observe a smooth decrease in thesubgoal likelihood, while the rate of the decay should reflect our assumed level of confidencein the expert. As it turns out, the induced BNIRL subgoal posterior distribution, shownin the top row for different values of β, contradicts this intuition. In particular, we observethat the model yields unreasonably high posterior values at the upper border states andcorners of the state space, which, according to our intuitive understanding of the problem,cannot be justified by the given demonstration set.

To pin down the cause of this effect, we recall from Equation (2) that the likelihood ofan action grows with the corresponding Q-value. Hence, we need to ask what causes theQ-values of the demonstrated actions to be large when the subgoal is assumed to be locatedat one of the upper corner/border states of the space. Using Bellman’s principle, we canexpress the optimal Q-function for any subgoal g as

Q∗(s, a | g) = Rg(s) + γ ET[V ∗(s′ | g) | s, a

]= Rg(s) + γ ET

[Eρπg [Rg(s

′′) | s′] | s, a]

(6)

= Rg(s) + γ ET[Cρπg(g | s′) | s, a

],

where V ∗(s | g) := maxa∈AQ∗(s, a | g), πg(s) := arg maxa∈AQ

∗(s, a | g) is the optimal pol-icy for subgoal g, and C is the subgoal reward from Equation (1). Lastly, ρπg(s′ | s) :=∑∞

t=0 γtpt(s

′ | s, πg) denotes the (improper) discounted state distribution generated by exe-cuting policy πg from the considered initial state s, where pt(s

′ | s, πg) refers to the proba-bility of reaching state s′ from state s under policy πg after exactly t steps, which is definedimplicitly via the transition model T .

The outer expectation in Equation (6) accounts for the stochastic transition to thesuccessor state s′, while the inner expectation evaluates the expected cumulative reward overall states s′′ that are reachable from s′. It is important to note that— by the constructionof the Q-function — only the first move of the agent to state s′ depends on the choice ofaction a whereas all remaining moves (i.e., the argument of the expectation in the last line)are purely determined by the system dynamics and the subgoal policy πg. Focusing onthat inner part, we conclude that, regardless of the chosen action a, the Q-values will belarge whenever the assumed subgoal induces a high state visitation frequency ρπg at its ownlocation g. The latter is fulfilled if

(i) the chance of reaching the goal in a small number of steps is high so that the effectof discounting is small and/or

(ii) the controlled transition dynamics T (s′ | s, πg(s)) that are induced by the subgoal leadto a high chance of hitting the goal frequently.

Note that the first condition implies that the model generally prefers subgoals that are closeto the demonstration set — a property that cannot be justified in all cases. For example,the recording of the demonstrations could have simply ended before the expert was able to

15


reach the goal (Figure 5). Yet, if desired, this proximity property should be more naturallyattributed to the subgoal prior model pg(g | s).

Moreover, we observe that the second condition depends primarily on the system dynam-ics T , which can be more or less strongly influenced by the actions of the agent, dependingon the scenario. In fact, in a pathological example, T could be even independent of theagent’s decisions, meaning that the agent has no control over its state. An example illustrat-ing this extreme case would be a scenario where the agent gets always driven to the sameterminal state, regardless of the executed policy. Although it is somewhat pointless speakof “subgoals” in this context, that terminal state would exhibit a high subgoal likelihoodaccording to the softmax model because the corresponding visitation frequency would beinevitably large. A softened variant of this condition can occur at corner/border states (i.e.,states in which the agent experiences fewer degrees of freedom and which are hence moredifficult to leave than others) and transition states (i.e., states that must be passed in orderto get from certain regions of the space to others), which naturally exhibit an increasedvisitation frequency due to the characteristics of the environment.

In our example in Figure 6, we can observe the symptoms of both described conditionsclearly. In particular, for an upward-directed policy as it is implied by the shown demon-stration set, the induced state visitation distribution exhibits increased values at exactlythe aforementioned border and corner states (due to the reflections occurring to the agentwhen hitting the state space boundary) as well as close to the trajectory ending (caused bythe proximity condition).

3.1.3 The Normalized Likelihood Model

To address these problems, we modify the likelihood model using a rescaling of the involvedQ-values. Let Q∧(s | g) and Q∨(s | g) denote the maximum and minimum Q-values at state sfor subgoal g, i.e., Q∧(s | g) := maxa∈AQ

∗(s, a | g) and Q∨(s | g) := mina∈AQ∗(s, a | g). We

then define the normalized state-action value function Q• : S ×A× S → [0, 1] as

Q•(s, a | g) :=

{Q∗(s,a | g)−Q∨(s | g)Q∧(s | g)−Q∨(s | g) if Q∧(s | g) 6= Q∨(s | g),

ε otherwise,(7)

where ε ∈ (0, 1] is an arbitrary constant that is canceled out in Equation (8). In contrastto the Bellman state-action value function Q∗, which quantifies the expected return ofan action, the normalized function Q• assesses the return of that action in relation to thereturns of all other actions. This concept is similar to that of the advantage function (Baird,1993) with the important difference that the values returned by Q• are normalized to therange [0, 1] and thus serve as an indicator for the relative quality of actions. Accordingly,the values can be interpreted as relative advantages (i.e., relative to the maximum possibleadvantage among all actions). The normalized subgoal likelihood model is then constructedanalogously to the BNIRL likelihood model,

π•(ad | sd, gzd) ∝ exp{βQ•(sd, ad | gzd)

}. (8)

The key property of this model is that it is invariant to affine transformations of the rewardfunction, as summarized by the following proposition.

16


β = 0.1 β = 1 β = 10

BN

IRL

mod

el

β = 0.1 β = 1 β = 10

nor

mal

ized

mod

el

← low probability high probability →

β = 0.1 β = 0.1

BNIRL model normalized model

Figure 6: Comparison of the subgoal posterior distributions induced by the original BNIRLlikelihood model and by the proposed normalized model, based on the grid worlddynamics described in Section 5.1 and a uniform subgoal prior distribution pg.The range of the shown color scheme is to be understood per subfigure. Blacksquares indicate wall states. The BNIRL likelihood model yields unreasonablyhigh subgoal posterior mass at the border states and corners of the state space(due to locally increased state visitation probabilities arising from wall reflections)as well as at trajectory endings (caused by the implicit proximity property of themodel)— see Section 3.1.2 for details. Both effects are mitigated by the proposednormalized likelihood model, which describes the action-selection process of theagent using relative advantages of actions instead of absolute returns.

17


Proposition 1 (Affine Invariance) Consider an MDP with reward function R : S → Rand let Q∗(s, a |R) denote the corresponding optimal state-action value function. For thecorresponding normalized function Q• it holds that Q•(s, a |R) = Q•(s, a |xR + y) ∀x ∈(0,∞), y ∈ R, s ∈ S, a ∈ A. Hence, the subgoal likelihood model in Equation (8) is invariantto affine transformations of R.

Proof Due to the linear dependence of Q∗ on the reward function R (Equation 3) it holdsthat Q∗(s, a |xR + y) = xQ∗(s, a |R) + y

1−γ . Using this relationship in Equation (7), itfollows immediately that Q•(s, a |R) = Q•(s, a |xR+ y).

Using the proposed likelihood model offers several advantages. First of all, it enables amore generic choice of the uncertainty coefficient β (Section 3.1.1). This is because thereturned Q•-values lie in the fixed range [0,1], where 0 always indicates the lowest and 1indicates the highest confidence. For example, setting β = log(β′) for some β′ ∈ (0,∞)always corresponds to the assumption that the expert chooses the optimal action with aprobability that is β′ times higher than the probability of choosing the least favorable action,irrespective of the underlying system model.

Moreover, as the results in Figure 6 reveal, the induced subgoal posterior distributionis notably closer to our expectation. The reason for this is twofold: first, a likelihoodcomputation based on relative advantages mitigates the influence of the transition dynamicsdiscussed in Section 3.1.2. This is because the described cumulation effect of the statevisitation distribution ρπg (Equation 6) is present in the returns of all actions and is thusreduced through the proposed normalization. For instance, if the agent in our grid worldfollows a policy that is all upward directed (as shown in the example), the induced statevisitation distribution exhibits increased values at the upper border states of the world, evenif we manipulated the first action of the agent (as considered in the Bellman Q-function).Accordingly, the original model would indicate an increased subgoal likelihood at thosestates. The normalized model, by contrast, is less affected as it constructs the likelihood byconsidering the increased visitation frequencies relative to each other.

Second, since the normalization diminishes the effect of the discounting, the subgoalposterior distribution is less concentrated around the trajectory ending and shows significantmass along the extrapolated path of the agent. This property allows us to identify farlocated states as potential goal locations, which adds more flexibility to the inferred subgoalconstellation (compare Figure 4). As an illustrating example, consider the scenario shown inthe bottom part of Figure 6. We observe that the normalized model assigns high posteriormass to all states in the right three corridors since any subgoal located in those corridorsexplains the demonstration set equally well. Here, the difference between the two models iseven more pronounced because the transition dynamics have a strong impact on the agentbehavior due to the added wall states. For further details, we refer to Section 5.1, wherewe provide additional insights into the subgoal inference mechanism.

3.2 Modeling Time-Invariant Intentions

With our redesigned likelihood model, we now focus on the partitioning structure of themodel. Herein, we first consider the case where the intentions of the agent are constant with

18


respect to time. As explained in Limitation 3, this setting is consistent with the standardMDP formalism in the sense that the optimal policy for the considered task can be describedin the form of a state-to-action mapping.

As a first step, to account for this relation, we establish a link between the model par-titioning structure and the underlying system state space by replacing the demonstration-

based indicators z = {zd ∈ N}Dd=1 with a new set of variables z := {zi ∈ N}|S|i=1. Unlike z,these new indicators do not operate directly on the data but are instead tied to the elementsin S. Although they formally represent a new type of variable, we can still imagine thattheir distribution follows a CRP. This yields an intermediate model of the form

p(a, z,G | s) = p(z)

∞∏k=1

pg(gk | s)

D∏d=1

π•(ad | sd, gzsd ),

whose structure is illustrated in Figure 2b. To see the difference to Equation (4), notice theway the subgoals are indexed in this model.

The intermediate model makes it possible to reason about the policy (or, more sug-gestively, the underlying state-to-action rule approximated by the expert) at visited partsof the state space. Yet, the model is unable to extrapolate the gathered information tounvisited states, for the reasons explained in Section 2.2. This problem can be solved byreplacing the exchangeable prior distribution over subgoal assignments induced by the CRPwith a non-exchangeable one, in order to account explicitly for the covariate state infor-mation contained in the demonstration set. Based on our insights from Bayesian policyrecognition (Sosic et al., 2018b), we use the distance-dependent Chinese restaurant process(ddCRP, Blei and Frazier, 2011) for this purpose, which allows a very intuitive handling ofthe state context, as explained below. For alternatives, we point to the survey paper byFoti and Williamson (2015).

In contrast to the CRP, which assigns states to partitions, the ddCRP assigns states toother states, based on their pairwise distances. These “to-state” assignments are described

by a set of indicators c := {ci ∈ S}|S|i=1 with prior distribution p(c) =∏|S|i=1 p(ci),

p(ci = j) =

{ν if i = j,

f(∆i,j) otherwise,(9)

for i, j ∈ S. Herein, ν ∈ [0,∞) is called the self-link parameter of the process, ∆i,j denotesthe distance from state i to state j, and f : [0,∞) → [0,∞) is a monotone decreasingscore function. Note that the distances {∆i,j} can be obtained via a suitable metric definedon the state space, which may be furthermore used for calibrating the score function f(see subsequent section). The state partitioning structure itself is then determined by theconnected components of the induced ddCRP graph (Figure 7). Our joint distribution,visualized in Figure 2c, thus reads as

p(a, c,G | s) = p(c)

∞∏k=1

pg(gk | s)

D∏d=1

π•(ad | sd, gz(c)|sd ), (10)

where z(c)|s denotes the subgoal label of state s arising from the considered indicator set c.In order to highlight the state dependence of the underlying subgoal mechanism, we referto this model as ddBNIRL-S.

19


3.2.1 The Canonical State Metric for Spatial Subgoal Modeling

The use of the ddCRP as a prior model for the state partitioning in Equation (10) inevitablyrequires some notion of distance between any two states of the system, in order to computethe involved function scores {f(∆i,j)}. When no such distances are provided by the problemsetting (see Limitation 2, second point), a suitable (quasi-)metric can be derived from thetransition dynamics of the system, which turns out to be the canonical choice for theddBNIRL-S model. Consider the Markov chain governing the state process {st=n}∞n=1 ofan agent for some specific policy π. For any ordered pair of states (i, j), the chain naturallyinduces a value Tπ

i→j , called a hitting time (Taylor and Karlin, 1984; Tewari and Bartlett,2008), which represents the expected number of steps required until the state process,initialized at i, eventually reaches state j for the first time,

Tπi→j := E

[min{n ∈ N : st=n = j} | s0 = i, π

].

In the context of our subgoal problem, the natural quasi-metric to measure the directeddistance between two states i and j is thus given by the time it takes to reach the goalstate j from the starting state i under the corresponding optimal subgoal policy πj(s) =arg maxa∈AQ

∗(s, a | j), i.e., ∆i,j := Tπji→j . For ddBNIRL-S (as well as for the waypoint

method in BNIRL), this choice is particularly appealing since the subgoal policies {πj} arealready available within the inference procedure after the state-action values have been com-puted for the likelihood model (more on this in Section 4.5). The corresponding distances{∆i,j} can be obtained efficiently in a single policy evaluation step since ∆i,j corresponds tothe optimal (negative) expected return at the starting state i for the special setting wherethe respective target state j is made absorbing with zero reward while all other states areassigned a reward of −1.

3.2.2 Choice of the Score Function

From Equation (9) it is evident that the ddCRP model favors partitioning structures thatresult from the connection of nearby states. In the context of the subgoal problem, thisproperty translates to the prior assumption that, most likely, each subgoal is approachedby the expert from only one specific localized region in the system state space. While thisassumption may be reasonable for some tasks, other tasks require that certain target statesbe approached more than one time, from different regions in the system state space. Insuch cases, it is beneficial if the model can reuse the same subgoal in various contexts, inorder to obtain a more efficient task encoding (Figure 4).

From a mathematical point of view, the prerequisite for learning such encodings is thatthe score function f does not shrink to zero at large distance values, so that there remainsa non-zero probability of connecting states that are far apart from each other. This can beachieved, for example, by representing f as a convex combination of a monotone decreasingzero-approaching function f : [0,∞)→ [0,∞) and some constant offset κ ∈ (0, 1],

f(∆) = (1− κ)f(∆) + κ,

where f is chosen, e.g., as a radial basis function (Sosic et al., 2018b). Note that, in orderto implement a desired degree of locality in the model, the scale of the decay function f (orf , respectively) can be further calibrated based on the quantiles of the distribution of thegiven distances {∆i,j}.

20


3.3 Modeling Time-Varying Intentions

For the case of changing expert intentions, we need to keep the flexibility of BNIRL toselect a new subgoal at each decision instant, instead of restricting our policy to targeta unique subgoal per state (Figure 5). Hence, we retain the basic BNIRL structure inthis case and define the subgoal allocation mechanism using a set of data-related indicatorvariables. However, in contrast to BNIRL, which makes no assumptions about the temporalrelationship of the subgoals and thus allows arbitrary changes of the expert’s intentions(Section 2.1), we design our joint distribution in a way that favors smooth action plans inwhich the expert persistently follows a subgoal over an extended period of time. Again, wecan make use of the ddCRP properties to encode the underlying smoothness assumption,but this time using a score function defined on the temporal distance between demonstrationpairs. For this purpose, we require an additional piece of information, namely the uniquetimestamp of each demonstration example. Accordingly, we need to assume that our dataset is of the form D := {(sd, ad, td)}Dd=1, where td denotes the recording time of the dthdemonstration pair (sd, ad).

2

The prior distribution over data partitionings can then be written as p(c) =∏Dd=1 p(cd),

p(cd = d′) ∝

{ν if d = d′,

f(∆d,d′) otherwise,

where the indices d, d′ ∈ {1, . . . , D} range over the size of the demonstration set. Herein,∆d,d′ := |td − td′ | denotes the temporal distance between the data points d and d′. Asbefore, we use the “∼”-notation to distinguish the data-related partitioning variables c, zand distances {∆d,d′} from their state-space-related counterparts c, z and {∆i,j} used inddBNIRL-S. Note, however, that the score function f is independent of the underlyingmodel type and may be chosen as described in Section 3.2.1, with a scale calibrated to theduration of the demonstrated task.

With that, we obtain our temporal subgoal model as

p(a, c,G | s) = p(c)

∞∏k=1

pg(gk | s)

D∏d=1

π•(ad | sd, gz(c)|d), (11)

where z(c)|d refers to the subgoal label of the dth demonstration pair induced by thegiven assignment c. Analogous to our spatial subgoal model, we refer to this model asddBNIRL-T. The structural differences between all models can be seen from Figure 2.

3.3.1 Relationship to BNIRL

Since the distance-dependent CRP contains the classical CRP as a special case for a spe-cific choice of distance metric and score function (Blei and Frazier, 2011), the ddBNIRL-Tmodel can be considered a strict generalization of the original BNIRL framework (neglect-ing the likelihood normalization in Section 3.1). In the same way, ddBNIRL-S generalizes

2. Note that the timestamps {td} are naturally available if the demonstrations are recorded in trajectoryform, where we observe several consecutive state-action pairs. In fact, the temporal information of thedata is also required for the waypoint method to work (Limitation 2), even though the authors of BNIRLformally assume to have access to the reduced data set of state-action pairs only.

21


the intermediate model presented in Section 3.2 (Figure 2). However, although the BNIRLmodel can be recovered from ddBNIRL, it is important to note that the sampling mech-anisms of both frameworks are fundamentally different. Whereas in BNIRL the subgoalassignments are sampled directly, the clustering structure in ddBNIRL is defined implicitlyvia the assignment variables c and c, respectively. As explained by Blei and Frazier (2011),this has the effect that the Markov chain governing the Gibbs sampler mixes significantlyfaster because several cluster assignments can be altered in a single step, which effectivelyrealizes a blocked Gibbs sampler (Roberts and Sahu, 1997).

3.4 Static versus Dynamic Subgoal Allocation

With the model structures described in Sections 3.2 and 3.3, we have presented two al-ternative views on the subgoal problem. Naturally, the question arises which of the twoapproaches is better suited for a particular application scenario. As explained in the previ-ous paragraphs, the main difference between the two models lies in their structure, i.e., inthe way subgoals are allocated. While ddBNIRL-S relies on a static assignment mechanismthat consistently links the individual states of a system to their corresponding subgoals,ddBNIRL-T allocates its subgoals per demonstration pair. The latter means that differentstate-action pairs observed at the same state can be explained using different intentionalsettings (Figure 5). To answer the above question, we hence need to ask under which con-ditions an observed decision-making process can be described via a static assignment rulethat uniquely characterizes each system state, and in which situations we require a moreflexible model that allows to take into account additional side information.

From decision-making theory, we know that the optimal solutions for time-invariantMDPs can be formulated as a deterministic time-invariant Markov policies (Puterman,1994), the class of which is fully covered by the static ddBNIRL-S framework.3 Therefore,assuming that the transition dynamics of our system are constant with respect to timeand that the agent acts rationally while having complete knowledge of the environment,there exist only two plausible reasons why we would potentially observe the agent executea time-variant policy:

• either, the reward model of the agent changes over time,• or, the observed decision-making process is not Markovian with respect to the assumed

state space model (i.e., the agent’s decisions depend on additional context informationthat is not explicitly captured in our state representation).

Accordingly, if we assume that the Markov property holds (meaning that the chosen staterepresentation is sufficiently rich to capture the decision-making strategy of the agent), theonly theoretical justification to prefer a dynamic subgoal model like ddBNIRL-T over astatic one such as ddBNIRL-T would be if we assume that the intentions of the agent aretruly time-dependent.

Practically speaking, however, there can be several reasons why a given state represen-tation might not fulfill the Markov requirement. One obvious explanation would be thatthe actual state space of the demonstrator is not perfectly known. This situation occurs, for

3. While we omit a rigorous proof here, this can be seen intuitively by noticing that any state-to-actionrule that is optimal for a given MDP reward function can be synthesized via ddBNIRL-S by assumingan individual subgoal for each state in the extreme case.

22


example, if not all state context available to the agent is observable by the modeler. Anotherpotential situation is when the strategy of the agent depends on information that is indepen-dent of the system dynamics and hence deliberately excluded from the state variable (i.e.,parameters that are unaffected by the actions of the agent, such as the preselection of aspecific high-level strategy). A generic framework for such settings is described by Danielet al. (2016b), where the agent learns multiple sub-policies that are triggered depending oncontext information that is treated separately from the state.

To an external observer who is unaware of that context information, the resulting policyof the agent would potentially appear time-dependent, in which case the only chance todisentangle the individual sub-policies would be to resort to a dynamic subgoal encoding,such as provided by ddBNIRL-T. However, if the context is known (like the temporal infor-mation in Section 3.3 as a particular example), both approaches can be used equivalentlyand will only differ in the resulting state representation. More specifically, we can eitherfall back on the static ddBNIRL-S model by augmenting the state variable with the con-text information accordingly, or we can resort to the dynamic subgoal allocation schemeof ddBNIRL-T, using a distance metric that accounts for the context. Conversely, whenconsidered in a purely time-invariant setting (where the context is described by some otherknown quantity), ddBNIRL-S and ddBNIRL-T can be regarded as two sides of the samecoin, i.e., both can be used to describe the time-invariant policy of an observed demonstratorbut they differ in the way the side information is represented.

4. Prediction and Inference

Having introduced the ddBNIRL framework, we now explain how it can be used to generalizea given expert behavior. To this end, we first focus on the task of action prediction at a givenquery state, and then explain in a second step how to extract the necessary information fromthe demonstration data. Along the way, we also give insights into the implicit intentionalmodel learned through the framework.

Note: In order to keep the level of redundancy at a minimum, the following considerationsare based on the ddBNIRL-S model. The results for ddBNIRL-T follow straightforwardly;the only change in the equations is the way the subgoals are referenced. To obtain thecorresponding expressions, we simply replace the assignment variables c with c and changethe cluster definition in Equation (16) to Ck := {d ∈ {1, . . . , D} : z(c)|d = k}. Accordingly,all occurrences of z(c)|s∗ change to z(c)|d∗ , z(c)|sd becomes z(c)|d, and sd ∈ Ck is replacedwith d ∈ Ck.

4.1 Action Prediction

Similar to the work by Abbeel and Ng (2004), we consider the task of predicting an actiona∗ ∈ A at some query state s∗ ∈ S that is optimal with respect to the expert’s unknownreward model. However, in contrast to most existing IRL methods, our approach is notbased on point estimates of the expert’s reward function but takes into account the entirehypothesis space of reward models. This allows us to obtain the full posterior predictivepolicy from the expert data. Mathematically, the task is formulated as computing the pre-dictive action distribution p(a∗ | s∗,D), which captures the full information about the expert

23


behavior contained in the demonstration set D. We start by expanding that distributionwith the help of the latent state assignments c,

p(a∗ | s∗,D) =∑

c∈S|S|p(a∗ | s∗,D, c)p(c | D).

The conditional distribution p(a∗ | s∗,D, c) can be expressed in terms of the posterior dis-tribution of the subgoal targeted at the query state s∗,

p(a∗ | s∗,D) =∑

c∈S|S|p(c | D)

∑i∈S

p(a∗ | s∗, c, gz(c)|s∗ = i)p(gz(c)|s∗ = i | D, c),

where we used the fact that the prediction a∗ is conditionally independent of the demon-stration set D given the state partitioning structure and the corresponding subgoal assignedto s∗ (that is, given c and gz(c)|s∗ ). From the joint distribution in Equation (10), it followsthat

p(gk | D, c) =1

Zk(D, c)pg(gk | s)

∏d:z(c)|sd=k

π(ad | sd, gk), (12)

where Zk(D, c) is the corresponding normalizing constant,

Zk(D, c) :=∑

i∈supp(pg)

pg(gk = i | s)∏

d:z(c)|sd=k

π(ad | sd, gk = i). (13)

Using this relationship, we get

p(a∗ | s∗,D) =∑

c∈S|S|

1

Zk(D, c)p(c | D)

∑i∈supp(pg)

pg(gz(c)|s∗ = i | s) . . .

. . . ×∏

d:z(c)|sd=z(c)|s∗

π(ad | sd, gz(c)|s∗ = i)p(a∗ | s∗, c, gz(c)|s∗ = i).

In contrast to the summation over subgoal locations i, whose computational complexity isdetermined by the support of the subgoal prior distribution pg and which grows at mostlinearly with the size of S, the marginalization with respect to the indicator variables cinvolves the summation of |S||S| terms and becomes quickly intractable even for small statespaces. Therefore, we approximate this operation via Monte Carlo integration, which yields

p(a∗ | s∗,D) ≈ 1

N

N∑n=1

∑i∈supp(pg)

p(gz(c{n})|s∗ = i | D, c{n})p(a∗ | s∗, c{n}, gz(c{n})|s∗ = i),

where c{n} ∼ p(c | D). The final prediction step can then be performed, for example, viathe maximum a posteriori (MAP) policy estimate,

π(s∗) := arg maxa∗∈A

p(a∗ | s∗,D). (14)

The inference task, hence, reduces to the computation of the posterior samples {c{n}},which is described in the next section.

24


4.2 Partition Inference

Based on the joint model in Equation (10), we obtain the posterior distribution p(c | D) infactorized form as

p(c | D) = p(c)∞∏k=1

∑gk∈supp(pg)

pg(gk | s)D∏d=1

π(ad | sd, gz(c)|sd )

= p(c)

|z(c)|∏k=1

∑gk∈supp(pg)

pg(gk | s)∏

d:sd∈Ck

π(ad | sd, gk), (15)

where Ck denotes the kth state cluster induced by the assignment c,

Ck := {s ∈ S : z(c)|s = k}, (16)

and |z(c)| is the total number of clusters defined by c. As explained by Blei and Frazier(2011), the indicator samples {c{n}} can be efficiently generated using a fast-mixing Gibbschain. Starting from a given ddCRP graph defined by the subset of indicators c\i := {cj}\ci,the insertion of an additional edge ci will result in one of three possible outcomes, asillustrated in Figure 7: in the case of adding a self-loop (ci = i), the underlying partitioningstructure stays unaffected. Setting ci 6= i either leaves the structure unchanged (if the targetstate is already in the same cluster as state i) or creates a new link between two clusters.In the latter case, the involved clusters are merged, which corresponds to a merging ofthe associated sums in Equation (15). According to these three cases, the conditionaldistribution for the Gibbs procedure is obtained as

p(ci = j | c\i,D) ∝

ν if i = j,

f(di,j) if no clusters are merged,

f(di,j)L(Czi∪Czj )

L(Czi )·L(Czj )if clusters Czi and Czj are merged.

(17)

Herein, L(C) denotes the marginal action likelihood of all demonstrations accumulated incluster C,

L(C) =∑

g∈supp(pg)

pg(g | s)∏

d:sd∈Cπ(ad | sd, g), (18)

which further represents the normalizing constant for the posterior distribution of the clustersubgoal (Equation 13). Accordingly, the fraction in Equation (17) can be interpreted as thelikelihood ratio of the partitioning defined by c\i and the merged structure after insertingthe new edge ci.

4.3 Subgoal Inference

It is important to note that the inference method described in Sections 4.1 and 4.2 is basedon a collapsed sampling scheme where all subgoals of our model are marginalized out.In fact, the ddBNIRL framework differs from BNIRL and other IRL methods in that thereward model of the expert is never made explicit for predicting new actions. Nonetheless, ifdesired (e.g., for the purpose of analyzing the expert’s intentions), an estimate of the subgoallocations can be obtained in a post-hoc fashion from the subgoal posterior distribution inEquation (12) for any given assignment c. Examples are provided in Figure 8.

25


(b)(a) (c)

Figure 7: Insertion of an edge (dashed arrow) to the ddCRP graph. Colors indicate thecluster memberships of the nodes, which are defined implicitly via the connectedcomponents of the graph. (a) Adding a self-loop or (b) inserting an edge betweentwo already connected nodes does not alter the clustering structure. (c) Addingan edge between two unconnected components merges the associated clusters.

4.4 Action Inference

As mentioned in Section 2.2, the original BNIRL algorithm requires complete knowledge ofthe expert’s action record a, which limits the range of potential application scenarios. Forthis reason, we generalize our inference scheme to the case where we have access to stateinformation only, provided in the form of an alternative data set D := {(sd, sd)}Dd=1, wheresd refers to the state visited by the expert immediately after sd. In this setting, inferencecan be performed by extending the Gibbs procedure with an additional collapsed samplingstage,

p(ad |a\d,D, c) ∝ T (sd | sd, ad)∑

i∈supp(pg)

pg(gz(c)|sd = i)∏

d′:z(c)|sd′=z(c)|sd

π(ad′ | sd′ , gz(c)|sd = i), (19)

which, for a fixed assignment c, recovers an estimate of the latent action set a from theobserved state transitions. Note that knowledge of the transition model T is required forthis step as it provides the necessary link between the expert’s actions and the observedsuccessor states. The same extension is possible for the ddBNIRL-T model, provided thatthe transition timestamps {td} are known (Section 3.3).

4.5 Computational Complexity

As a last point in this section, we would like to discuss the computational complexity ofour approach. For this purpose, here a quick reminder on the used notation: we write |S|and |A| for the cardinalities of the state and action space, respectively, and use the letter Dfor the size of the demonstration set. Further, we write Ck to refer to the kth state cluster(ddBNIRL-S) or data cluster (ddBNIRL-T). In the subsequent paragraphs, we additionallyuse the notation ND(Ck) to access the number of demonstration data points associated with

26


cluster Ck, K to indicate the number of clusters in the current iteration, Ng := |supp(pg)|as a shorthand for the size of the support of the subgoal prior distribution, and Nc for thenumber of indicator variables, i.e., Nc := |S| for ddBNIRL-S and Nc := D for ddBNIRL-T.

Initialization Phase: Common to all discussed models (including BNIRL) is that theydepend on a preceding planning phase, where we compute, potentially in parallel, thestate-action value functions (Equation 3) for all Ng considered subgoals, which allows usto construct the subgoal likelihood model (Equation 2 or 8). The overall computationalcomplexity of this procedure is of order O(NgCMDP(|S|, |A|)), where CMDP(x, y) denotesthe complexity of the used planning routine to (approximately) solve an MDP of size xwith a total number of y actions. Using a value iteration algorithm, for instance, this canbe achieved in O(CMDP(|S|, |A|)) = O(|S|2|A|) steps (Littman et al., 1995). If we assumethat the expert reaches all subgoals during the demonstration phase (Michini and How,2012), we can restrict the support of the subgoal prior to the visited states, so that Ng isupper-bounded by min(|S|, D). Note that there exist approximation techniques that makethe computation tractable in large/continuous state spaces (see discussion in Section 6).

Before we start the sampling procedure, we compute all single-cluster likelihoods {L(Ck)}and pairwise likelihoods {L(Ck ∪Ck′)} according to Equation (18), based on some (random)initial cluster structure. The likelihood computation for the kth cluster Ck involves a prod-uct over ND(Ck) data points, which needs to be calculated for each of the Ng subgoalsbefore taking their weighted average. This step has to be executed (potentially in parallel)for all clusters. However, because each demonstration is associated with exactly one cluster(either directly as in ddBNIRL-T or via the corresponding state variable as in ddBNIRL-S)and hence

∑kND(Ck) = D, the total complexity for computing all single-cluster likelihoods

is of order O(NgD), irrespective of the actual cluster structure. A similar line of reasoningapplies to the computation of the pairwise likelihoods, yielding the same complexity order.Yet, for the latter we need to consider all possible cluster combinations. Assuming an initialnumber of K clusters, there are in total K(K − 1)/2 pairwise likelihoods to be computed.Hence, the overall complexity of the initialization phase can be summarized as O(NgDK

2).

Partition Inference: For the partition inference, the bulk of the computation lies in therepeated construction of the likelihood term in Equation (17), which needs to be updatedwhenever the cluster structure changes. To analyze the complexity, we consider the samplingstep of an individual assignment variable ci (or likewise ci). In the worst case, removingthe edge that belongs to ci from the ddCRP graph divides the associated cluster into twoparts (Figure 7), so that two new single-cluster likelihoods need to be computed. With theupper bound D on the number of data points associated with the cluster before the division,this operation is of worst-case complexity O(NgD) (see initialization phase). Irrespectiveof whether a division occurs, we then need to compute all pairwise cluster likelihoods withthe (new) cluster connected via ci. For a total of K − 1 possible choices, this is done inO(NgDK) operations (see initialization phase). After assigning the indicator, we move onto the next variable where the process repeats. If we assume, for simplicity, that the num-ber of clusters stays constant during a full Gibbs cycle, the total complexity of updatingall cluster assignments is hence of order O(NgDKNc). A (pessimistic) upper bound for thegeneral case can be obtained by assuming that each data point defines its own cluster, in

27


which case the complexity increases to O(NgD2Nc). Note that, in order to identify the new

cluster structure after changing an assignment, we additionally need to track the connectedcomponents of the underlying ddCRP graph. As explained by Kapron et al. (2013), thiscan be done in polylogarithmic worst-case time.

Action Sampling: In order compute the conditional probability distribution of a particularaction ad, we need to evaluate a product involving all actions that belong to the same clusteras action ad (Equation 19). First, we can compute the product over all actions except aditself, where the number of involved terms is again upper-bounded by D. Appending theterm that belongs to ad for all possible action choices requires another |A| operations.These two steps need to be repeated for all possible subgoals, yielding an upper bound onthe complexity of order O(Ng(D+ |A|)). For a full Gibbs cycle, which involves sampling allD action variables, the overall (worst-case) complexity is hence of order O(Ng(D+ |A|)D).

5. Experimental Results

In this section, we present experimental results for our framework. The evaluation is sepa-rated into four parts:

(i) a proof of concept and conceptual comparison to BNIRL (Section 5.1),(ii) a performance comparison with related algorithms (Section 5.2),

(iii) a real data experiment conducted on a KUKA robot (Section 5.3) and(iv) an active learning task (Section 5.4).

5.1 Proof of Concept

To illustrate the conceptual differences to BNIRL and provide additional insights into thelatent intentional model learned through our framework, we begin with the motivating dataset from Figure 1a, which had been originally presented by Michini and How (2012). Theconsidered system environment, defined by |S| = 20 × 20 = 400 grid positions, is againshown in the top left corner of Figure 8. Nine of those positions correspond to inaccessiblewall states, marked by the horizontal black bar. At the valid states, the expert can choosefrom an action set comprising a total of eight actions, each initiating a noisy state transitiontoward one of the (inter-)cardinal directions. The observed state-action pairs are depictedin the form of arrows, whose colors indicate the MAP partitioning learned through BNIRL.The remaining subfigures show the results of the ddBNIRL framework, which were obtainedfrom a posterior sample returned by the respective algorithm (ddBNIRL-S/T) at a lowtemperature in a simulated annealing schedule (Kirkpatrick et al., 1983).

Comparing the obtained results, we observe the following main differences to the originalapproach:

(i) Unlike BNIRL, the proposed framework allows to choose between a spatial and atemporal encoding of the observed task, providing the possibility to account explicitlyfor the type of demonstrated behavior (static/dynamic). As explained in Section 3.3.1,the context-unaware (yet in principle dynamic) vanilla BNIRL inference scheme is stillincluded as a special case.

28


subgoal posteriors

sample partitionings

predictive policies

ddBNIRL-SBNIRL ddBNIRL-T

phase 1: green subgoal phase 2: yellow subgoal phase 3: red subgoalspatial policy(all subgoals combined)

temporal policy

Figure 8: Results on the BNIRL data set (Michini and How, 2012). Top row: demon-stration data and sample partitionings generated by the different inference algo-rithms. Center row: subgoal posterior distributions associated with the parti-tions found by ddBNIRL-S and ddBNIRL-T. For a clearer overview, the corre-sponding BNIRL distributions are omitted (see Figure 6 for a comparison). Bot-tom row: time-invariant ddBNIRL-S policy model synthesized from all three de-tected subgoals (left) and temporal phases identified by ddBNIRL-T (right). Thebackground colors have no particular meaning and were added only to highlightthe structures of the policies. Because of its missing generalization mechanism,BNIRL does not itself provide a reasonable predictive policy model (Limitation 1).

29


(ii) Exploiting the spatial/temporal context of the data, the ddBNIRL solution is in-herently robust to demonstration noise, giving rise to notably smoother partitioningstructures (top row). This effect is particularly pronounced in the case of real data,as we shall see later in Section 5.3.2.

(iii) For each state partition or trajectory segment, we obtain an implicit representationof the associated subgoal in the form of a posterior distribution, without the need ofassigning point estimates (center row). It is striking that the posterior distributioncorresponding to the green state partition has a comparably large spread on the upperside of the wall. This can be explained intuitively by the fact that any subgoal locatedin this high posterior region could have potentially caused the green state sequence,which circumvents the wall from the right. At the same time, the green area of highposterior values exhibits a sharp boundary on the left side since a subgoal located inthe upper left region of the state space would have more likely resulted in a trajectoryapproaching from the left.

(iv) In contrast to BNIRL, which has no built-in generalization mechanism (Limitation 1),our method returns a predictive policy model comprising the full posterior action in-formation at all states. Note that we only show the resulting MAP policy estimateshere (bottom row), computed according to Equation (14). Additional results concern-ing the posterior uncertainty are provided in Sections 5.3 and 5.4.

The example illustrates how the synthesis of the predictive policy differs between ddBNIRL-S(bottom left) and ddBNIRL-T (bottom row, rightmost three subfigures). While ddBNIRL-Tuses a set of (conditionally) independent policy models to describe the different identi-fied behavioral phases, ddBNIRL-S maps the entire subgoal schedule onto a single time-invariant policy representation. Looking closer at the learned models, we recognize that theddBNIRL-S solution in fact realizes a spatial combination of the three temporal ddBNIRL-Tcomponents, where each component is activated in the corresponding cluster region of thestate space. This gives us two alternative interpretations of the same behavior.

5.2 Random MDP Scenario

Our next experiment is designed to provide insights into the generalization abilities of theframework. For this purpose, we consider a class of randomly generated MDPs similar to theGarnet problems (Bhatnagar et al., 2009). The transition dynamics {T (· | s, a)} are sampledindependently from a symmetric Dirichlet distribution with a concentration parameter of0.01, where we choose |S| = 100 and |A| = 10. For each repetition of the experiment, NR

states are selected uniformly at random and assigned rewards that are, in turn, sampleduniformly from the interval [0, 1]. All other states contain zero reward. Next, we computean optimal deterministic MDP policy π∗ with respect to a discount factor of γ = 0.9 andgenerate a number of expert trajectories of length 10. Herein, we let the expert select theoptimal action with probability 0.9 and a random, suboptimal action with probability 0.1.The obtained state sequences are passed to the algorithms and we compute the normalizedvalue loss of the reconstructed policies according to

L(π∗, π) :=‖V∗ −Vπ‖2‖V∗‖2

, (20)

30


100

101

102

0

0.2

0.4

0.6

0.8

1

(a) NR = 1 (= |S|/100)

100

101

102

0

0.2

0.4

0.6

0.8

(b) NR = 10 (= |S|/10)

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

(c) NR = 50 (= |S|/2)

100

101

102

0

0.1

0.2

0.3

0.4

(d) NR = 100 (= |S|)

Figure 9: Comparison of all inference methods in the random MDP scenario for differentreward densities. Shown are the empirical mean values and standard deviationsof the resulting value losses, obtained from 100 Monte Carlo runs. The graphsshow a clear difference between BNIRL, BNIRL-EXT and ddBNIRL-S, which il-lustrates the importance of considering the spatial context for subgoal extraction.

where V∗ and Vπ represent, respectively, the vectorized value functions of the optimalpolicy π∗ and the reconstruction π.

Since the considered system belongs to the class of time-invariant MDPs, ddBNIRL-Slends itself as the natural choice to model the expert behavior. As baseline methods, weadopt our subintentional Bayesian policy recognition framework (BPR, Sosic et al., 2018b),as well as maximum-margin IRL (Abbeel and Ng, 2004), maximum-entropy IRL (Ziebartet al., 2008), and vanilla BNIRL. Due to the missing generalization abilities of BNIRL(Limitation 1) and because the waypoint method (Section 2.2) does not straightforwardlyapply to the considered scenario of multiple unaligned trajectories, we further compare ouralgorithm to an extension of BNIRL, which we refer to as BNIRL-EXT. Mimicking theddBNIRL-S principle, the method accounts for the spatial context of the demonstrations

31


by assigning each state to the BNIRL subgoal that is targeted by the closest (see metric inSection 3.2.1) state-action pair— however, these assignments are made after the actual sub-goal inference. When compared to ddBNIRL-S, this provides a reference of how much canbe gained by considering the spatial relationship of the data during the inference. For theexperiment, both ddBNIRL-S and BNIRL(-EXT) are augmented with their correspondingaction sampling stages (Section 4.4) since the action sequences of the expert are discardedfrom the data set, in order to enable a fair comparison to the remaining algorithms.

Figure 9 shows the value loss over the size of the demonstration set for different re-ward settings. For small NR, both BNIRL(-EXT) and ddBNIRL-S significantly outperformthe reference methods. This is because the sparse reward structure allows for an efficientsubgoal-based encoding of the expert behavior, which enables the algorithms to reconstructthe policy even from minimal amounts of demonstration data. However, the BNIRL(-EXT)solutions drastically deteriorate for denser reward structures. In particular, we observe aclear difference in performance between the cases where

(i) we do not account for the spatial information in the partitioning model (BNIRL),

(ii) include it in a post-processing step (BNIRL-EXT), and

(iii) exploit it during the inference itself (ddBNIRL-S),

which demonstrates the importance of processing the context information. Most tellingly,ddBNIRL-S outperforms the baseline methods even in the dense reward regimes, althoughthe subgoal-based encoding loses its efficiency here. In fact, the results reveal that theproposed approach combines the merits of both model types, i.e., the sample efficiency ofthe intentional models (max-margin/max-entropy) required for small data set sizes, as wellas the asymptotic accuracy and fully probabilistic nature of the subintentional Bayesianframework (BPR).4

5.3 Robot Experiment

In the next experiment, we test the ddBNIRL framework on various real data sets, which werecorded on a KUKA lightweight robotic arm (Figure 10) via kinesthetic teaching. Videosof all demonstrated tasks can be found at http://www.spg.tu-darmstadt.de/jmlr2018.

The system has seven degrees of freedom, corresponding to the seven joints of the arm.Each joint is equipped with a torque sensor and an angle encoder, providing recordingsof joint angles, velocities and accelerations. For our experiments, we only consider thexy-Cartesian coordinates spanning the transverse plane, which we computed from the rawmeasurements using a forward kinematic model. The data was recorded at a sampling rateof 50 Hz and further downsampled by a factor of 10, yielding an effective sample rate of5 Hz, which provided a sufficient temporal resolution for the considered scenario.

The goal of the experiment is to learn a set of high-level intentional models for therecorded behavior types by partitioning the data sets into meaningful parts that can be usedto predict the desired motion direction of the expert. For simplicity and to demonstrate the

4. The comparably large loss of BPR for small data set sizes can be explained by the fact that the frameworkis based on a more general policy model in which the expert behavior is assumed to be inherentlystochastic, in contrast to the here considered setting where stochasticity arises merely a consequence ofsuboptimal decision-making.

32

http://www.spg.tu-darmstadt.de/jmlr2018


Figure 10: KUKA lightweight robotic arm.

algorithm’s robustness to modeling errors, we adopt the simplistic transition model fromSection 5.1 with the same action set containing the eight (inter-)cardinal motion directions.The high measurement accuracy of the end-effector position allows us to extract these high-level actions directly from the raw data, i.e., by selecting the directions with the smallestangular deviations from the ground truth (see example in Figure 11a). The underlying statespace is obtained by discretizing the part of the coordinate range that is covered by themeasurements into blocks of predefined size (see next sections for details). Apart from thisdiscretization step and the aforementioned data downsampling, no preprocessing is applied.

5.3.1 Spatial Partitioning

First, we consider a case where the expert behavior can be described using a time-invariantpolicy model, which we aspire to capture via ddBNIRL-S. For our example, we consider the“Cycle” task shown in the video and in Figure 14. The same setting is analyzed using thetime-variant ddBNIRL-T model in Section 5.3.2, which allows a direct comparison of thetwo approaches. The task consists in approaching a number of target positions, indicatedby a set of markers (see video), before eventually returning to the initial state. The settingcan be regarded as a real-world version of the “Loop” problem described by Michini andHow (2012). As explained in their paper, classical IRL algorithms that rely on a globalstate-based reward model (such as max-margin IRL and max-entropy IRL) completely failon this problem, due to the periodic nature of the task.

Figure 11a shows the downsampled and discretized data set (black arrows) obtainedfrom four expert trajectories (white lines). For visualization purposes, the discretizationblock size is chosen as 2 cm×2 cm, giving rise to a total of 18 × 24 = 432 states. As inthe top row of Figure 8 (ddBNIRL-S), the coloring of the background indicates the learnedpartitioning structure, computed from a low-temperature posterior sample. We observethat the found state clusters clearly reveal the modular structure of the task, providing anintuitive and interpretable explanation of the data. However, although the induced policy

33


(a) expert data & sample partitioning (b) MAP policy estimate

(c) uncertainty estimate (d) final predictive model

Figure 11: Results of ddBNIRL-S on the “Cycle” task. (a) Raw measurements (whitelines) and discretized demonstration data (black arrows). The coloring of thebackground indicates a partitioning structure obtained from a low-temperatureposterior sample. (b) Maximum a posteriori policy estimate. (c) Visualizationof the model’s prediction uncertainty at all system states, represented by theentropies of the corresponding posterior predictive action distributions. Darkbackground indicates high uncertainty. (d) Illustration of the final predictivemodel, comprising both the action information and the prediction uncertainty.

model (Figure 11b) smoothly captures the cyclic nature of the task, we cannot expect toobtain trustworthy predictions in the center region of the state space, due to the lack ofadditional demonstration data that would be required to unveil the expert’s true intentionin that region. Clearly, a point estimate such as the shown MAP policy cannot reflect thisprediction uncertainty since it does not carry any confidence information. Yet, followinga Bayesian approach, we can naturally quantify the prediction uncertainty at any querystate s∗ based on the shape of the corresponding posterior predictive action distributionp(a∗ | s∗,D). A straightforward option is, for example, to consider the prediction entropy,

34


defined as

H(s∗) :=∑a∗∈A

p(a∗ | s∗,D) log p(a∗ | s∗,D).

In order to obtain an unbiased approximation of the true non-tempered predictive distribu-tion p(a∗ | s∗,D), we run a second Gibbs chain with unaltered temperature in parallel to thetempered chain. The resulting entropy estimates are summarized in an uncertainty map(Figure 11c), which we overlaid on the original prediction result to produce the final figureshown at the bottom right. Note that the obtained posterior uncertainty information of themodel can be further used in an active learning setting, as demonstrated in Section 5.4.

5.3.2 Temporal Partitioning

Next, we turn our attention to the ddBNIRL-T model, which we test against the vanillaBNIRL approach. For this purpose, we consider the full collection of tasks shown in thesupplementary video, which comprises different time-dependent expert behaviors of varyingcomplexity. In order to obtain a quantitative performance measure for our evaluation, weconducted a manual segmentation of all recorded trajectories, thereby creating a set ofground truth subgoal labels for all observed decision times. The result of this segmentationstep is depicted in the appendix (Figure 14, center column). Note that the ground truthsubgoals are assumed immediately at the ends of the corresponding segments.

The left and right column of Figure 14 show, respectively, the partitioning structuresfound by BNIRL and ddBNIRL-T, based on a uniform subgoal prior distribution withsupport at the visited states. The underlying state discretization block size is chosen as1 cm×1 cm, as indicated by the regular grid in the background. A simple visual comparisonof the learned structures reveals the clear superiority of ddBNIRL-T over vanilla BNIRLon this problem set.

For our quantitative comparison, we consider the instantaneous subgoal localizationerrors of the two models over the entire course of a demonstration (Figure 12). Herein,the instantaneous localization error for a given state-action pair is measured in terms ofthe Euclidean distance between the grid location of the ground truth subgoal associatedwith the pair and the corresponding subgoal location predicted by the model. Note that thepredictions of both models are based on the entire trajectory data of an experiment, consid-ering the full posterior information after completing the demonstration. For ddBNIRL-T,which does not directly return a subgoal location estimate but instead provides access to thefull subgoal posterior distribution, the error is computed with respect to the MAP subgoallocations {gk},

gk := arg maxgk∈supp(pg)

p(gk | D, c),

using the ddBNIRL-T version of Equation (12)— see note at the beginning of Section 4.

The black dots in the figure indicate the time instants where the ground truth annota-tions change. At those time instants, we observe significantly increased localization errorsfor both models, which can be explained by the fact that the ground truth annotation issomewhat subjective around the switching points (see labeling in Figure 14). Also, wenotice a comparably high error at the beginning and the end of some trajectories, whichstems from the imperfect synchronization between the recording interval and the execution

35


0

10

20

30

40

50

Figure 12: Instantaneous subgoal localization errors of ddBNIRL-T (upper rows) andBNIRL (lower rows) for the eight recorded data sets. The black dots indicatethe subgoal switching times in the corresponding ground truth subgoal annota-tion, depicted in the center column of Figure 14. On average, the localizationerror of ddBNIRL-T is significantly lower compared to the BNIRL approach, asindicated by the median values shown on the left. For a qualitative comparisonof the underlying partitioning structures, see Appendix A.

of the task (recall that we skipped the corresponding data preprocessing step). Hence, tocapture the accuracy in a single figure of performance, we consider the median localizationerror of each time series, as it masks out these outliers and provides a more realistic errorquantification than the sample mean. The obtained values are shown next to the errorplots in Figure 12, indicating that the ddBNIRL-T localization error is in the range of thediscretization interval in most cases. Compared to BNIRL, the proposed method yields anerror reduction of more than 70% on average.

5.4 Active Learning

In Section 5.3, we saw that the posterior predictive action distribution p(a∗ | s∗,D) providesa natural way to quantify the prediction uncertainty of our model at any given query state s∗.This offers the opportunity to apply the framework in an active learning setting, since the

36


0 200 400 600 800 100010

-4

10-3

10-2

10-1

100

Figure 13: Comparison between random data acquisition and active learning in the randomMDP scenario. Shown are the empirical mean value losses of the obtained policymodels over the number of data queries, obtained from 200 Monte Carlo runs.

induced uncertainty map (see example in Figure 11c) indicates in which parts of the statespace the trained model can process further instructions from the expert most effectively.

To demonstrate the basic procedure, we reconsider the random MDP problem fromSection 5.2 in an active learning context, where we compare different active strategies withthe previously used random data acquisition scheme. As an initialization for the learningprocedure, we request a single state-action pair (s1, a1) from the demonstrator, which westore in the initial data set D1 := {(s1, a1)}. Herein, the state s1 is drawn uniformly atrandom from S and the action a1 ∼ πE(a | s1) is generated according to the noisy expertpolicy πE : A × S → [0, 1] described in Section 5.2. Continuing from this point, each ofthe considered active learning algorithms requests a series of subsequent demonstrations((s2, a2), (s3, a3), . . .), inducing a sequence of data sets (D1,D2,D3, . . .), where the nextquery state sd+1 is chosen according to the specific data acquisition criterion facq of thealgorithm evaluated on the current predictive model,

Dd+1 = Dd ∪ {(sd+1, ad+1)}sd+1 = arg max

s∗∈Sfacq

[p(a∗ | s∗,Dd)

]ad+1 ∼ πE(a | sd+1).

The purpose of the acquisition criterion is to assess the uncertainty of the model at all pos-sible query states, so that the next demonstration can be requested in the high uncertaintyregion of the state space (see uncertainty sampling, Settles, 2010). For our experiment, weconsider the following three common choices,

• highest entropy: facq(p) := −∑a∈A

p(a) log p(a),

• least confidence: facq(p) := 1−maxa∈A

p(a),

• smallest margin: facq(p) := p(a2)− p(a1),

37


where a1 and a2 denote, respectively, the most likely and second most likely action accordingto the considered distribution p, i.e., a1 := arg maxa∈A p(a) and a2 := arg maxa∈A\a1 p(a).At each iteration, we compute the value losses (Equation 20) of the induced policy modelsand compare them with the corresponding loss obtained from random data acquisition. Theresulting curves are delineated in Figure 13. As expected, the learning speed of the modelis significantly improved under all active acquisition schemes, which reduces the number ofexpert demonstrations required to successfully learn the observed task.

6. Conclusion

Building upon the principle of Bayesian nonparametric inverse reinforcement learning, weproposed a new framework for data-efficient IRL that leverages the context information ofthe demonstration set to learn a predictive model of the expert behavior from small amountsof training data. Central to our framework are two model architectures, one designed forlearning spatial subgoal plans, the other to capture time-varying intentions. In contrast tothe original BNIRL model, both architectures explicitly consider the covariate informationcontained in the demonstration set, giving rise to predictive models that are inherentlyrobust to demonstration noise. While the original BNIRL model can be recovered as aspecial case of our framework, the conducted experiments show a drastic improvement overthe vanilla BNIRL approach in terms of the achieved subgoal localization accuracy, whichstems from both an improved likelihood model and a context-aware clustering of the data.Most notably, our framework outperforms all tested reference methods in the analyzedbenchmark scenarios while it additionally captures the full posterior information aboutthe learned subgoal representation. The resulting prediction uncertainty about the expertbehavior, reflected by the posterior predictive action distribution, provides a natural basisto apply our method in an active learning setting where the learning system can requestadditional demonstration data from the expert.

The current limitation of our approach is that both presented architectures require anMDP model with discrete state and action space. While the subgoal principle carries overstraightforwardly to continuous metric spaces, the construction of the likelihood modelbecomes difficult in these environments as it requires knowledge of the optimal state-actionvalue functions for all potential subgoal locations. However, for BNIRL, there exist severalways to approximate the likelihood in these cases (Michini et al., 2013) and the sameconcepts apply equally to ddBNIRL. Thus, an interesting future study would be to comparethe efficacy of both model types on larger problems involving continuous spaces, where itappears even more natural to follow a distance-based approach.

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research andinnovation program under grant agreement No #713010 (GOALRobots) and No #640554(SKILLS4ROBOTS).

38


Appendix A. Robot experiment

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

BNIRL ground truth ddBNIRL-T

(a) Cycle

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

(b) Snake

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

(c) Pentagon

Figure 14: Motion sequences without trajectory crossings, which can be represented usinga spatial subgoal pattern.

39


0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45


(d) Star

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

35

40

(e) Hourglass

Figure 14 (continued): Motion sequences with few trajectory crossings, requiring a time-varying subgoal representation.

40


0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

45


(f) Flower (Const)

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

(g) Flower (Var)

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 350

5

10

15

20

25

30

35

40

45

(h) Tree

Figure 14 (continued): Long motion sequences comprising a large number of sub-patternswith overlapping parts that can be only separated by considering thetemporal context. Flower (Const): all strokes are performed withthe same absolute velocity. Flower (Var): the individual strokes areperformed with alternating velocity.

41


References

P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. InInternational Conference on Machine Learning, page 1, 2004.

M. Al-Emran. Hierarchical reinforcement learning: a survey. International Journal ofComputing and Digital Systems, 4(2), 2015.

S. V. Albrecht and P. Stone. Autonomous agents modelling other agents: a comprehensivesurvey and open problems. arXiv:1709.08071 [cs.AI], 2017.

D. J. Aldous. Exchangeability and Related Topics. Springer, 1985.

B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning fromdemonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.

M. Babes-Vroman, V. Marivate, K. Subramanian, and M. Littman. Apprenticeship learningabout multiple intentions. In International Conference on Machine Learning, pages 897–904, 2011.

L. C. Baird. Advantage updating. Technical report, Wright Lab, 1993.

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms.Automatica, 45(11), 2009.

D. M. Blei and P. I. Frazier. Distance dependent Chinese restaurant processes. Journal ofMachine Learning Research, 12(Nov):2461–2488, 2011.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of MachineLearning Research, 3(Jan):993–1022, 2003.

M. M. Botvinick. Hierarchical reinforcement learning and decision making. Current Opinionin Neurobiology, 22(6):956–962, 2012.

S. J. Bradtke and M. O. Duff. Reinforcement learning methods for continuous-time Markovdecision problems. In Advances in Neural Information Processing Systems, pages 393–400,1994.

N. Cesa-Bianchi, C. Gentile, G. Neu, and G. Lugosi. Boltzmann exploration done right. InAdvances in Neural Information Processing Systems, pages 6275–6284, 2017.

J. Choi and K.-E. Kim. Nonparametric Bayesian inverse reinforcement learning for multiplereward functions. In Advances in Neural Information Processing Systems, pages 305–313,2012.

C. Daniel, H. Van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determiningoptions in reinforcement learning. Machine Learning, 104(2-3):337–357, 2016a.

Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relativeentropy policy search. Journal of Machine Learning Research, 17(1):3190–3239, 2016b.

42


C. Dimitrakakis and C. A. Rothkopf. Bayesian multitask inverse reinforcement learning. InEuropean Workshop on Reinforcement Learning, pages 273–284, 2011.

N. J. Foti and S. A. Williamson. A survey of non-exchangeable priors for Bayesian non-parametric models. IEEE Transactions on Pattern Analysis and Machine Intelligence,37(2):359–371, 2015.

M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning:a survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015.

B. M. Kapron, V. King, and B. Mountjoy. Dynamic graph connectivity in polylogarithmicworst case time. In Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1131–1142, 2013.

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.Science, 220(4598):671–680, 1983.

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques.MIT Press, 2009.

G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto. Robot learning from demonstrationby constructing skill trees. International Journal of Robotics Research, 31(3):360–375,2012.

S. Krishnan, A. Garg, R. Liaw, L. Miller, F. T. Pokorny, and K. Goldberg. HIRL: hi-erarchical inverse reinforcement learning for long-horizon tasks with delayed rewards.arXiv:1604.06508 [cs.RO], 2016.

S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with Gaus-sian processes. In Advances in Neural Information Processing Systems, pages 19–27,2011.

R. Lioutikov, G. Neumann, G. Maeda, and J. Peters. Learning movement primitive librariesthrough probabilistic segmentation. International Journal of Robotics Research, 36(8):879–894, 2017.

M. L. Littman, T. L. Dean, and L. P. Kaelbling. On the complexity of solving Markovdecision problems. In Conference on Uncertainty in Artificial Intelligence, pages 394–402, 1995.

B. Michini and J. P. How. Bayesian nonparametric inverse reinforcement learning. In JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages148–163, 2012.

B. Michini, M. Cutler, and J. P. How. Scalable reward learning from demonstration. InIEEE International Conference on Robotics and Automation, pages 303–308, 2013.

B. Michini, T. J. Walsh, A.-A. Agha-Mohammadi, and J. P. How. Bayesian nonparametricreward learning from demonstration. IEEE Transactions on Robotics, 31(2):369–386,2015.

43


G. Neu and C. Szepesvari. Apprenticeship learning using inverse reinforcement learning andgradient methods. In Conference on Uncertainty in Artificial Intelligence, pages 295–302,2007.

A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In InternationalConference on Machine Learning, pages 663–670, 2000.

A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations:Theory and application to reward shaping. In International Conference on MachineLearning, pages 278–287, 1999.

Q. P. Nguyen, B. K. H. Low, and P. Jaillet. Inverse reinforcement learning with locallyconsistent reward functions. In Advances in Neural Information Processing Systems,pages 1747–1755, 2015.

S. Niekum, S. Osentoski, G. Konidaris, and A. G. Barto. Learning and generalization ofcomplex tasks from unstructured demonstrations. In IEEE/RSJ International Conferenceon Intelligent Robots and Systems, pages 5239–5246, 2012.

A. Panella and P Gmytrasiewicz. Interactive POMDPs with finite-state models of otheragents. Autonomous Agents and Multi-Agent Systems, pages 861–904, 2017.

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.John Wiley & Sons, Inc., 1994.

D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. InternationalJoint Conference on Artificial Intelligence, pages 2586–2591, 2007.

P. Ranchod, B. Rosman, and G. Konidaris. Nonparametric Bayesian reward segmenta-tion for skill discovery using inverse reinforcement learning. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 471–477, 2015.

G. O. Roberts and S. K. Sahu. Updating schemes, correlation structure, blocking andparameterization for the Gibbs sampler. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 59(2):291–317, 1997.

C. A. Rothkopf and C. Dimitrakakis. Preference elicitation and inverse reinforcement learn-ing. In Joint European Conference on Machine Learning and Knowledge Discovery inDatabases, pages 34–48, 2011.

E. Rueckert, G. Neumann, M. Toussaint, and W. Maass. Learned graphical models for prob-abilistic planning provide a new class of movement primitives. Frontiers in ComputationalNeuroscience, 6:97, 2013.

S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primitives. RoboticsResearch, pages 561–572, 2005.

B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2010.

44


O Simsek, A. P. Wolfe, and A. G. Barto. Identifying useful subgoals in reinforcementlearning by local graph partitioning. In International Conference on Machine Learning,pages 816–823, 2005.

A. Sosic, A. M. Zoubir, and H. Koeppl. Inverse reinforcement learning via nonparametricsubgoal modeling. In AAAI Spring Symposium on Data-Efficient Reinforcement Learning,2018a.

A. Sosic, A. M. Zoubir, and H. Koeppl. A Bayesian approach to policy recognition andstate representation learning. IEEE Transactions on Pattern Analysis and Machine In-telligence, 40(6):1295–1308, 2018b.

M. Stolle and D. Precup. Learning options in reinforcement learning. In InternationalSymposium on Abstraction, Reformulation, and Approximation, pages 212–223, 2002.

A. Surana and K. Srivastava. Bayesian nonparametric inverse reinforcement learning forswitched markov decision processes. In IEEE International Conference on Learning andApplications, pages 47–54, 2014.

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: a framework fortemporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211,1999.

M. Tamassia, F. Zambetta, W. Raffe, and X. Li. Learning options for an MDP from demon-strations. In Australasian Conference on Artificial Life and Computational Intelligence,pages 226–242, 2015.

H. M. Taylor and S. Karlin. An Introduction to Stochastic Modeling. Academic Press, 1984.

A. Tewari and P. L. Bartlett. Optimistic linear programming gives logarithmic regret forirreducible MDPs. In Advances in Neural Information Processing Systems, pages 1505–1512, 2008.

J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo. Evaluating bag-of-visual-wordsrepresentations in scene classification. In International Workshop on Multimedia Infor-mation Retrieval, pages 197–206, 2007.

S. Zhifei and E. M. Joo. A survey of inverse reinforcement learning techniques. InternationalJournal of Intelligent Computing and Cybernetics, 5(3):293–311, 2012.

B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inversereinforcement learning. In AAAI Conference on Artificial Intelligence, pages 1433–1438,2008.

45

inverse reinforcement learning via nonparametric spatio ...an implicit intentional model of the...

Documents