bayesian nonparametric feature construction for inverse reinforcement learning · 2014-09-30 ·...

Bayesian Nonparametric Feature Constructionfor Inverse Reinforcement Learning

Jaedeug Choi and Kee-Eung KimDepartment of Computer Science

Korea Advanced Institute of Science and TechnologyDaejeon 305-701, Korea

[email protected], [email protected]

AbstractMost of the algorithms for inverse reinforcementlearning (IRL) assume that the reward function isa linear function of the pre-defined state and actionfeatures. However, it is often difficult to manuallyspecify the set of features that can make the truereward function representable as a linear function.We propose a Bayesian nonparametric approach toidentifying useful composite features for learningthe reward function. The composite features areassumed to be the logical conjunctions of the pre-defined atomic features so that we can represent thereward function as a linear function of the compos-ite features. We empirically show that our approachis able to learn composite features that capture im-portant aspects of the reward function on syntheticdomains, and predict taxi drivers’ behaviour withhigh accuracy on a real GPS trace dataset.

1 IntroductionInverse reinforcement learning (IRL) aims to recover the ex-pert’s underlying reward function from her demonstrationsand the environment model [Russell, 1998]. IRL is applied toincreasingly various research areas, such as robotics [Argallet al., 2009], computer animation [Lee and Popovi, 2010],preference learning [Erkin et al., 2010; Ziebart et al., 2008b]and cognitive science [Baker et al., 2009].

In the last decade, a number of studies on IRL algo-rithms have appeared in the literature [Ng and Russell, 2000;Abbeel and Ng, 2004; Ratliff et al., 2006; Syed et al., 2008;Ziebart et al., 2008a]. Most of them, especially when deal-ing with large state spaces, assume that the reward function isa linear function of some pre-defined features. The problemthen reduces to learning the unknown weights of the linearfunction, but the result is highly dependent on the selectionof features. Hence, it is desirable to automatically constructthe features that compactly describe the structure of the re-ward function [Abbeel and Ng, 2004].

FIRL [Levine et al., 2010] is one of the few IRL algorithmsthat construct the features for the reward function. It is as-sumed that the domain expert supplies the set of all poten-tially relevant features as atomic features, and the algorithmconstructs the logical conjunctions of these features for the

reward function. GPIRL [Levine et al., 2011] is another IRLalgorithm that represents the reward function as a nonlinearfunction of the atomic features using the Gaussian process(GP). The reward features are implicitly learned as the hyper-parameters of the GP kernel.

In this paper, we propose a Bayesian nonparametric ap-proach to constructing the reward function features in IRL.We extend Bayesian IRL [Ramachandran and Amir, 2007] bydefining a prior on the composite features, which are definedto be the logical conjunctions of the atomic features. Sincethe number of composite features is not known a priori, wedefine the prior using the Indian buffet process (IBP) to inferthe features and the number of features. Our approach hasa number of advantages: First, it learns an explicit represen-tation of the reward function features in the form of logicalformulas, which are readily interpretable. Second, it can ro-bustly learn from a noisy behaviour data since it uses a prob-abilistic model of the data. Third, it can incorporate the do-main knowledge on the reward features and their weights intothe prior distribution since it is a Bayesian framework.

2 PreliminariesWe use the following notations throughout the paper: x isa column vector with elements xi and x−i = x\xi. X isa matrix with elements Xi,j , X :,j is the j-th column of X ,X−(i,j) = X\Xi,j , andX−i,j = X :,j\Xi,j .

We denote a discrete-state Markov decision pro-cess (MDP) [Puterman, 1994] as a tuple M =〈S,A, {T a}a∈A, r, γ〉 where S is the set of states, Ais the set of actions, T a is the state transition probability suchthat T as,s′ = P (s′|s, a), r is the (state-dependent) rewardfunction, and γ ∈ [0, 1) is the discount factor.

A policy is defined as a mapping π : S → A. The valueof the policy π is defined as the expected discounted cumu-lative rewards of executing π, which satisfies the equationvπ = r + γT πvπ where Tπs,s′ = P (s′|s, π(s)). Similarly,theQ-function is defined by the equationQπ

:,a = r+γT avπ .Given an MDPM, an optimal policy π∗ maximizes the valuefunction for all the states. The value of an optimal pol-icy should satisfy the Bellman optimality equation: v∗ =maxa∈A(r + γT av∗).

The IRL refers to the problem of inferring the rewardfunction of an expert given an MDP\r and the expert’s be-

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

1287

haviour data D. We assume that D = {τ1, . . . , τN} isgenerated by executing an optimal policy with respect tothe unknown reward vector r, where the n-th trajectoryτn is an H-step sequence of state-action pairs, i.e., τn ={(sn,1, an,1), . . . , (sn,H , an,H)}.

2.1 BIRL: Bayesian Framework for IRLRamachandran and Amir [2007] proposed a Bayesian frame-work for IRL by modeling the compatibility of the rewardwith the behaviour data as the likelihood and the preferenceon the rewards as the prior. The likelihood is defined as anindependent exponential, or softmax, distribution:

P (D|r, η) =∏Nn=1 P (τn|r, η)

=∏Nn=1

∏(s,a)∈τn

exp(ηQ∗s,a(r))∑a′∈A exp(ηQ∗

s,a′ (r)) (1)

where η is the parameter representing the confidence of ac-tions being optimal, and Q∗(r) is the optimal Q-functioncomputed using r. Assuming that the reward entries are in-dependently distributed, the prior on the reward function isdefined as P (r) =

∏s∈S P (rs). Various distributions such

as the uniform, normal, or Laplace distribution can be usedas the prior depending on the domain knowledge on the re-ward function. The reward function is then inferred fromthe posterior distribution, using Markov Chain Monte Carlo(MCMC) for estimating posterior mean [Ramachandran andAmir, 2007] or gradient ascent for estimating the maximum-a-posteriori (MAP) [Choi and Kim, 2011].

2.2 Indian Buffet ProcessThe Indian buffet process (IBP) [Ghahramani et al., 2007]defines a distribution over binary matrices with infinitelymany columns. This process can be explained by a culi-nary metaphor: The customers enter an Indian buffet oneafter another and choose dishes from an infinite number ofdishes. The first customer chooses dishes whose numberis distributed by Poisson(α). Thereafter, the i-th customerchooses the dish j with a probability of ζj/i where ζj is thenumber of customers who have previously chosen the dishj. In addition, the i-th customer chooses new dishes whosenumber is distributed by Poisson(α/i). Let Z be an M ×Kbinary matrix where M and K is the total number of cus-tomers and dishes, whose elements Zij = 1 indicates thatthe i-th customer chooses the dish j. By taking the limitK → ∞, we obtain the probability distribution induced bythe IBP:

PIBP(Z|α) =∏Kk=1

αK Γ(ζk+ α

K )Γ(M−ζk+1)

Γ(M+1+ αK ) (2)

where ζk =∑Mm=1 Zmk. The parameter α controls the spar-

sity of the matrix.

3 BNP-FIRL: Bayesian NonparametricFeature Construction for IRL

We assume that the reward vector is linearly parameterized sothat r = Φw, where Φ = [φ1, . . . ,φK ] is a reward featurematrix whose k-th column φk is an |S|-dimensional binaryvector representing the k-th reward feature, and w is an K-dimensional weight vector.

As we have mentioned in the introduction, most of theIRL algorithms assume that the reward features are alreadygiven, and just compute the weights [Ng and Russell, 2000;Abbeel and Ng, 2004; Ratliff et al., 2006; Syed et al., 2008;Ziebart et al., 2008a]. In contrast, the goal of our work is tolearn the reward features as well as to compute the weights.A small assumption made here is that a domain expert pro-vides a set of (binary) atomic features, which includes all therelevant indicators that may possibly influence determiningthe rewards in some unknown way. Since the true rewardfunction is hardly a linear function of the atomic features, weshould learn the reward features as (non-linear but hopefullynot too complex) functions of atomic features, so that we canstill represent the reward function as a linear function of re-ward features. We refer to the reward features as compositefeatures in order to distinguish them from atomic features.

In the following sections, we describe our Bayesian non-parametric approach to the feature construction in IRL, wherethe composite features are constructed by taking the logicalconjunctions of the atomic features. We first define a non-parametric prior over the logical conjunctions of the atomicfeatures using the IBP. We then derive the posterior overthe composite features and the reward weights by extendingBIRL. Finally, we present an MCMC algorithm for the pos-terior inference.

3.1 Prior for Feature ConstructionGiven a matrix of atomic features, defined as Ψ =[ψ1, . . . ,ψM ] where the m-th column ψm is an |S|-dimensional binary vector representing the m-th atomic fea-ture, we define the constructed composite feature matrix asΦ = f(x,Z,U ; Ψ) with:

φk =∧

m∈{1,...,M} s.t.xm=1∧Zm,k=1

ψ(k)

m

where x is an M -dimensional binary vector indicating theparticipation of the m-th atomic feature in any of the com-posite features, Z is an M ×K binary matrix indicating theparticipation of them-th atomic feature in the k-th compositefeature, and U is also an M × K binary matrix indicatingwhether the m-th atomic feature is negated in the formula for

the k-th composite feature so that ψ(k)

m = ψm if Um,k = 1

and ψ(k)

m = ¬ψm otherwise.We define the prior on Z, U , and x as follows: first, we

start with Z that indicates which atomic features are usedfor each composite feature. Each composite feature generallyconsists of more than one exchangeable atomic features. Inaddition, we do not know a priori the number K of columnsin Z, which is the number of the composite features to beconstructed. Thus, analogous to the latent feature modeling,we use the IBP as the prior on Z:

P (Z|α) = PIBP(Z|α). (3)

Note that, by using the IBP, we inherently assume that Kis unbounded even though it is at most 3M . The rationalebehind our assumption is thatZ can have duplicated columnsas well as infinitely many zero columns. Second, we use theBernoulli distribution for every entry of the binary matrix U ,

1288

Figure 1: Graphical model of BNP-FIRL

so that the atomic features are negated with p = 0.5:

P (U) =∏m,k P (Um,k) =

∏m,k PBer(Um,k; 0.5). (4)

Finally, we define the prior on the binary vector x. Sincexm = 0 implies that the atomic feature ψm is not used atall for any composite feature, we use the prior that favors xbeing sparse. Specifically, we use the Bernoulli distributionwith p = κ where κ is Beta distributed with β = [β1, β2]:

P (κ|β) = PBeta(κ;β = [β1, β2])

P (xm|κ) = PBer(xm;κ). (5)

This is analogous to controlling the row-wise sparsity in theIBP proposed by Rai and Daume III [2008].

By combining Eqns (3), (4), and (5), the prior is defined as

P (Φ|α,β,Ψ) = P (x|β)P (Z|α)P (U) (6)

where P (x|β) =∫ ∏

m P (xm|κ)P (κ|β)dκ.

3.2 Posterior InferenceBNP-FIRL extends BIRL by using the prior defined in theabove and treating the behaviour D = {τ1, . . . , τN} as beingdrawn from the generative process as follows:

κ|β ∼ Beta(β = [β1, β2])

xm|κ ∼ Bernoulli(κ)

Z|α ∼ IBP(α)

Um,k ∼ Bernoulli(0.5)

wk ∼ P (wk)

Φ := f(x,Z,U ; Ψ)

r := Φw

τn|r, η ∼∏

(s,a)∈τnexp(ηQ∗s,a(r))∑

a′∈A exp(ηQ∗s,a′ (r)) .

Fig 1 shows the graphical model of BNP-FIRL. Note that, asin BIRL, the reward weights wk’s are assumed to be indepen-dently distributed so that P (w) =

∏Kk=1 P (wk).

The posterior over the composite features for the rewardfunction and the associated weights is then formulated as

P (Φ,w|D,Θ) ∝ P (D|Φ,w, η)P (w)P (Φ|α,β,Ψ) (7)

where Θ = {η, α,β,Ψ}, P (D|Φ,w, η) is the BIRL likeli-hood defined in Eqn (1) with r = Φw, P (w) is the prioron the weights, and P (Φ|α,β,Ψ) is the prior on the com-posite features defined in Eqn (6). Note that α controls thenumber of composite features to be constructed and β con-trols the total number of the unique atomic features used in

Algorithm 1 MCMC algorithm for BNP-FIRLInitialize x,Z,U ,w.for t = 1 to T do

for m = 1 to M doSample xm according to Eqn (8).

end forfor m = 1 to M do

for k = 1 to K doSample Um,k according to Eqn (9).

end forend forfor m = 1 to M do

for k = 1 to K doif∑

i6=m Zi,k > 0 thenSample Zm,k according to Eqn (10).

end ifend forPropose ξ = 〈K+,Z+,U+,w+〉: K+ ∼ Poisson(α/M).Accept ξ with probability min{1, ρZ}.

end forfor k = 1 to K do

Propose w′ ∼ N (wk, λ).Accept w′ with probability min{1, ρw}.

end forend for

the construction. In other words, α and β are the parametersthat control the column-wise and the row-wise sparsity of thecomposite feature matrix Φ, respectively.

We infer the composite features and the reward weightsfrom the posterior using an MCMC algorithm described asfollows (Algorithm 1): we first update xm by sampling fromthe probability distribution conditioned on the rest of the ran-dom variables,P (xm = 1|D,x−m,Z,U ,w,Θ)

∝ P (D|f(x,Z,U ; Ψ),w, η)(β1 +∑i6=m xi)

P (xm = 0|D,x−m,Z,U ,w,Θ)

∝ P (D|f(x,Z,U ; Ψ),w, η)(β2 +M −∑i6=m xi). (8)

Note in the above that, instead of drawing the Bernoulli distri-bution parameter κ from Beta(β = [β1, β2]) and then drawingxm, we have collapsed κ for an efficient inference.

Next, in order to sample Um,k which indicates whether weshould negate the atomic featureψm in the composite featureφk, we sample it from the likelihoodP (Um,k|D,x,Z,U−(m,k),w,Θ)

∝ P (D|f(x,Z,U ; Ψ),w, η) (9)since the prior on Um,k is uniformly random in {0, 1}.

We then sample Zm,k, which indicates whether the atomicfeatureψm appears in the composite feature φk. For the sakeof exposition, we rephrase the IBP culinary metaphor intoour context, the atomic features being the customers and thecomposite features being the dishes. The first atomic featurechooses Poisson(α) composite features. The m-th atomicfeature chooses the k-th composite feature (already chosen bypreceding atomic features) with probability

∑m−1i=1 Zi,k/m

and additionally chooses Poisson(α)/m new composite fea-tures, where Zi,k = 1 indicates that the i-th atomic featureψi have chosen the k-th composite feature φk.

1289

This metaphor and the exchangeability of the atomic fea-tures leads to the update method for Zm,: [Rai and DaumeIII, 2008] as follows: first, for the composite features that arealready chosen (i.e.,

∑i6=m Zi,k > 0), we update Zm,k ac-

cording to the conditional distribution

P (Zm,k = 1|D,x,Z−m,k,U ,w,Θ)

∝ P (D|f(x,Z,U ; Ψ),w, η)∑i6=m Zi,k

P (Zm,k = 0|D,x,Z−m,k,U ,w,Θ)

∝ P (D|f(x,Z,U ; Ψ),w, η)(M −∑i6=m Zi,k). (10)

Second, for choosing the new composite features, we sam-ple Zm,k using the Metropolis-Hastings (MH) update: wefirst sample ξ = 〈K+,Z+,U+,w+〉 where K+ ∼Poisson(α/m), Z+ is the m ×K+ binary matrix whose en-tries in the m-th row are set to 1 and others are set to 0, andthe m ×K+ binary matrix U+ and K+-dimensional vectorw+ are drawn from the corresponding priors. We then acceptξ with probability min{1, ρZ} where

ρZ =P (D|f(x, [Z,Z+], [U ,U+]; Ψ), [w;w+], η)

P (D|f(x,Z,U ; Ψ),w, η). (11)

In the above equation, [X,Y ] denotes horizontal concatena-tion, and [x;y] denotes vertical concatenation.

Finally, we sample the weights wk, again using the MHupdate. We first sample w′ ∼ N (wk, λ) and then accept itwith probability min{1, ρw} where

ρw =P (D|f(x,Z,U ; Ψ),wnew, η)P (w′)

P (D|f(x,Z,U ; Ψ),w, η)P (wk). (12)

with wnew formed by taking wnewk = w′ and wnew−k = w−k.

The posterior mean is commonly used for inferring the re-ward function since it is known to minimize the square er-ror LSE(r, r) = ||r − r||2 [Ramachandran and Amir, 2007].We thus estimate the posterior mean of the reward function.On the other hand, when we show the learned composite fea-tures, we choose the sample with the maximum posterior, i.e.,〈ΦMAP, wMAP〉 = argmax〈Φ(t),w(t)〉 P (Φ(t),w(t)|D,Θ)

and rMAP = ΦMAPwMAP. This is because the sample mean is

ill-defined for Φ(t)

’s with different dimensions.

4 Experimental ResultsIn this section, we show the performance of BNP-FIRL in anumber of problem domains and compare it to FIRL [Levineet al., 2010] and GPIRL [Levine et al., 2011].1 The per-formance was evaluated using the expected value difference(EVD) 1

|S| ||v∗(r) − vπ(r′)(r)||1 where r is the expert’s re-

ward function (i.e., the ground truth), r′ is the learned rewardfunction, and π(r′) is the optimal policy computed using r′.The EVD thus can be seen as a measurement of the loss inthe optimality incurred by using the policy from the learnedreward function instead of the expert’s reward function. In or-der to evaluate how well the composite features are learned,we computed the EVD on the same problem instance (i.e., the

1code available at http://graphics.stanford.edu/projects/gpirl

Table 1: Learned features for 32× 32 objectworld domain.Weights Reward features

φ1 8.61 d(s, c1) < 3 ∧ d(s, c2) < 2φ2 -11.53 d(s, c1) < 3 ∧ ¬(d(s, c2) < 2)φ3 -7.29 d(s, c1) < 3 ∧ ¬(d(s, c2) < 2) ∧ d(s, c3) < 9φ4 0.51 ¬(d(s, c3) < 9)

training problem instance) where the expert’s behaviour datawas generated, and on additional random problem instances(i.e., transfer problem instances) as well.

4.1 Objectworld DomainThe first set of experiments was performed on the objectworlddomain [Levine et al., 2011], where the agent can move north,south, east, west, or stay in the current location in an N ×Ngrid. Each action has a failure probability of 0.3 which makesthe agent move in a random direction. There are a number ofcolored objects randomly placed in the grid, each of themwith one of the C ≥ 2 colors.

We prepared a total of CN atomic features of the form“d(s, ci) < j” indicating that the distance between the agent’slocation s and the nearest color i object is less than j, wherei ∈ {1, . . . , C} and j ∈ {1, . . . , N}. The true reward func-tion was set to

rs =

1 if d(s, c1) < 3 ∧ d(s, c2) < 2,

−2 if d(s, c1) < 3 ∧ ¬(d(s, c2) < 2),

0 otherwise.

We generated 10 random training problem instances by sam-pling the locations of objects and their colors, and gatheredtrajectories of length 20. In order to measure how well thelearned reward function generalizes to novel yet similar prob-lem instances, we measured EVD on additional 10 transferproblem instances for each training problem instance, gener-ated in the same way we prepared training problem instances.

The top row in Fig 2 shows the EVD performances whenthe trajectories are generated by the optimal policy. BNP-FIRL outperformed FIRL in the experiments. This is mainlydue to the fact that BNP-FIRL is allowed to have multiplecomposite features for each state, whereas FIRL can onlyhave one because of the way the algorithm constructs thecomposite features. Tbl 1 shows the 4 reward features andthe corresponding weights found by BNP-FIRL. In compari-son, FIRL produced a total of 21 features. On the other hand,BNP-FIRL was on par in performance with GPIRL. Nonethe-less, one of the advantages of using BNP-FIRL is that thelearned features are explicitly represented and thus readily in-terpretable.

The bottom row in Fig 2 shows the EVD performanceswhen the trajectories are generated from an ε-greedy policy. Itis interesting to observe that BNP-FIRL outperforms GPIRLand FIRL. We conjecture that this is due to the likelihood inBNP-FIRL being particularly robust to noisy behaviour data.

4.2 Simulated-highway DomainThe second set of experiments was done on the simulated-highway domain [Levine et al., 2011], where the agent drivesa vehicle by moving one lane left or right at speeds {1, 2, 3, 4}

1290

Figure 2: Averages and standard errors of the EVD on 10 random instances of the 32 × 32 objectworld domain, with Toprow: trajectories generated by the optimal policy, and Bottom row: trajectories generated by ε-greedy policy (ε = 0.2). Lefttwo columns: fix C = 4 (the number of colors) and vary |D| (the number of trajectories). Right two columns: vary C and fix|D| = 50.

Figure 3: Averages and standard errors of the performanceresults over 10 instances of the highway domain.

on a three-lane highway. The actions that change the laneor speed fail with probability 0.3. There are other vehiclesdriving at speed 1, each of them being either a civilian or apolice vehicle and either a car or a motorcycle (a total com-bination of 4 categories). We prepared an expert that prefersto drive as fast as possible but avoids driving at speeds 3 or4 within a distance of 2 from a police vehicle. We also pre-pared 3 types of atomic features: The first indicate the cur-rent speed (4 features) and the second indicate the currentlane (3 features). The third indicate the distance to the near-est car from each category, of the form “d(s, c) ≤ j” wherec ∈ {police, civilian}×{car, motorcycle} and j ∈ {0, . . . , 5}(24 features). We generated the trajectory data by executingthe optimal policy for 200 time steps.

Fig 3 shows that BNP-FIRL again performs better thanFIRL. On the other hand, BNP-FIRL performs slightly worsethan GPIRL although, as shown in Fig 4, the learned rewardfunctions from the two algorithms were very similar to thetrue one. On the other hand, FIRL was not able to capturegood features for the reward function and as a result, thelearned reward function is very different from the true one.

4.3 Taxi Driver Behaviour PredictionThe final set of experiments was done on learning the taxidrivers’ preference using the GPS trace data collected inSan Francisco [Piorkowski et al., 2009]. We modeled the

road network as a graph, which consists of 20531 roadsegments obtained from the OPENSTREETMAP.2 As inZiebart et al. [2008b], we assumed that the taxi drivers tryto reach a destination in a trip by taking the road segmentsaccording to their preferences. Note that each trip typicallyhas a different destination. We used the goal-oriented MDPMg to define the environment model, where the states areroad segments, and the state transitions correspond to takingone of the road segments at intersections. The rewards wereassumed to be negative everywhere except at the destination,which was represented as an absorbing goal state g with azero reward.

The composite features φ(s,a),k and the weights wk wereassumed to be independent of the destination. In otherwords, given the destination g, the reward was calculatedas rg = F (g) · w where wk < 0, F(g,a),k(g) = 0and F(s,a),k(g) = φ(s,a),k for all s ∈ S\{g}. Thischange leads to a slight modification to the likelihood sothat P (D|Φ,w, η) =

∏n P (τn|rgn , η;Mgn) where gn is

the destination in the trajectory τn. Note that, although wehave different MDPs Mgn for each trajectory due to differ-ent destination, they all share the same reward features Φ andweights w.

We prepared state-dependent atomic features representingthe properties of the road segments, such as the type, thespeed limit, the number of lanes, and one-way.3 We alsoprepared action-dependent atomic features representing theangle of the turn.4

Among the traces of 500 taxis in the original data, weselected 500 trips from 10 taxis, visualized in Fig 5. The

2http://www.openstreetmap.org3type ∈ {highway, primary street, secondary street, living

street}, speed limit∈ {below 20 mph, 20-30 mph, 30-40 mph, above40 mph}, # lanes ∈ {1, 2, 3+}. These features were obtained usingthe OPENSTREETMAP

4turn angle ∈ {hard left, soft left, straight, soft right, hard right,u-turn}.

1291

Figure 4: Rewards for the states at each speed for the highway domain. The numbersbelow the figures present the speed and the brightness of each state represents the reward(brighter=higher).

Figure 5: GPS traces of 500trips from 10 taxis collected inSan Francisco.

accumulated distance of the trips was 1137 miles. Wethen segmented the GPS traces and mapped them to theroad segments using a preprocessing algorithm described inLou et al. [2009].

We compared BNP-FIRL to two baseline methods and twoIRL algorithms in the literature. As for the first baseline,we used the shortest path to the destination. As for the sec-ond baseline, we used the optimal policy from an MDP withheuristically set reward weight w′. A simple heuristic wouldbe counting the feature visitations in the trace data, and setthe reward weights accordingly. Since we enforce the rewardweights to be negative, we used the normalized counts of fea-ture visitations offset by the maximum count value. As forthe two IRL algorithms, we chose the maximum entropy IRL(MaxEntIRL) [Ziebart et al., 2008a] and the gradient ascentBIRL (MAP-BIRL) [Choi and Kim, 2011]. Since both algo-rithms depend on a pre-defined set of features, we suppliedthe set of all atomic features. We do not report the resultsfrom FIRL and GPIRL since it was not straightforward tomodify them to appropriately handle multiple MDPs sharingthe same reward features and weights.

Tbl 2 shows the average prediction accuracy and their stan-dard errors. The prediction accuracy was measured in termsof the turn and the route predictions. For each trip, we com-puted the path from the origin to the destination using the al-gorithms. The correct actions taken at the intersections werecounted in the turn prediction accuracy, and the ratio of thetotal distance taking the correct road segments was calcu-lated for the route prediction accuracy. All the results wereobtained by the 5-fold cross validation. Tbl 3 presents theweights and composite features learned by BNP-FIRL, whichare fairly intuitive. For example, making hard left turns (φ1)is avoided with the highest penalty. Making turns on a high-way is also highly undesirable except making a soft right turnto take an exit ramp (φ2). It was also found that taxi driversprefer to take the primary street (φ9) or the road with 20-30mph limit (φ8), rather than the highway (φ5) or the road withspeed limit higher than 40 mph (φ3). This was because thetrips were generally short, the average distance being 2.27miles.

5 ConclusionWe presented BNP-FIRL, a Bayesian nonparametric ap-proach to constructing reward features in IRL. We defined the

Table 2: Driver behaviour prediction results.Turn prediction (%) Route prediction (%)

Shortest path 77.06 (±0.14) 32.91 (±0.21)MDP (w′) 83.79 (±0.21) 43.42 (±0.23)MaxEntIRL 80.27 (±0.67) 42.80 (±1.07)MAP-BIRL 84.97 (±0.58) 46.87 (±0.92)BNP-FIRL (MAP) 86.22 (±0.24) 48.42 (±0.54)BNP-FIRL (Mean) 86.28 (±0.18) 48.70 (±0.70)

Table 3: Learned features for driver behaviour prediction.Weights Reward features

φ1 -2.40 hard left turnφ2 -1.98 highway ∧¬(secondary street) ∧¬(below 20 mph)

∧¬(soft right turn) ∧¬(straight)φ3 -1.38 above 40 mphφ4 -0.41 ¬(highway) ∧¬(oneway)φ5 -0.27 highwayφ6 -0.24 ¬(straight)φ7 -0.21 ¬(below 20 mph) ∧¬(2 lanes) ∧¬(oneway)φ8 -0.12 20-30 mphφ9 -0.11 primary street

reward features as the logical conjunctions of the atomic fea-tures provided by the domain expert, and formulated a non-parametric prior over the reward features using the IBP. Wederived an MCMC algorithm to carry out the posterior in-ference on the reward features and the corresponding rewardweights.

We showed that BNP-FIRL outperforms or performs onpar with prior feature-learning IRL algorithms through exper-iments on synthetic domains. In comparison to FIRL, BNP-FIRL produces more succinct sets of features with richer rep-resentations and learns better reward functions. In compar-ison to GPIRL, BNP-FIRL produces the set of features thatare explicitly represented and readily interpretable. We alsopresented reward feature learning results on a real GPS tracedata collected from taxi drivers, predicting their behaviourwith higher accuracy than prior IRL algorithms.

AcknowledgmentsThis work was supported by National Research Foundation ofKorea (Grant# 2012-007881), the Defense Acquisition Pro-gram Administration and Agency for Defense Developmentof Korea (Contract# UD080042AD), and the IT R&D pro-gram of MKE/KEIT (Contract# 10041678).

1292

References[Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y. Ng.

Apprenticeship learning via inverse reinforcement learn-ing. In Proceedings of the 21st International Conferenceon Machine Learning, 2004.

[Argall et al., 2009] Brenna D. Argall, Sonia Chernova,Manuela Veloso, and Brett Browning. A survey of robotlearning from demonstration. Robotics and AutonomousSystems, 57(5):469–483, 2009.

[Baker et al., 2009] Chris L. Baker, Rebecca Saxe, andJoshua B. Tenenbaum. Action understanding as inverseplanning. Cognition, 113(3):329–349, 2009.

[Choi and Kim, 2011] Jaedeug Choi and Kee-Eung Kim.Map inference for bayesian inverse reinforcement learn-ing. In Advances in Neural Information Processing Sys-tems 24, 2011.

[Erkin et al., 2010] Zeynep Erkin, Matthew D. Bailey,Lisa M. Maillart, Andrew J. Schaefer, and Mark S.Roberts. Eliciting patients’ revealed preferences: An in-verse markov decision process approach. Decision Analy-sis, 7(4):358–365, 2010.

[Ghahramani et al., 2007] Zoubin Ghahramani, Thomas L.Griffiths, and Peter Sollich. Bayesian nonparametric la-tent feature models. Bayesian Statistics, 8:1–25, 2007.

[Lee and Popovi, 2010] Seong Jae Lee and Zoran Popovi.Learning behavior styles with inverse reinforcement learn-ing. ACM Transaction on Graphics, 29:1–7, 2010.

[Levine et al., 2010] Sergey Levine, Zoran Popovic, andVladlen Koltun. Feature construction for inverse reinforce-ment learning. In Advances in Neural Information Pro-cessing Systems 23, 2010.

[Levine et al., 2011] Sergey Levine, Zoran Popovic, andVladlen Koltun. Nonlinear inverse reinforcement learningwith gaussian processes. In Advances in Neural Informa-tion Processing Systems 24, 2011.

[Lou et al., 2009] Yin Lou, Chengyang Zhang, Yu Zheng,Xing Xie, Wei Wang, and Yan Huang. Map-matching forlow-sampling-rate gps trajectories. In Proceedings of the17th ACM SIGSPATIAL International Conference on Ad-vances in Geographic Information Systems, 2009.

[Ng and Russell, 2000] Andrew Y. Ng and Stuart Russell.Algorithms for inverse reinforcement learning. In Pro-ceedings of the 17th International Conference on MachineLearning, 2000.

[Piorkowski et al., 2009] Michal Piorkowski, NatasaSarafijanovoc-Djukic, and Matthias Grossglauser. Aparsimonious model of mobile partitioned networkswith clustering. In Proceedings of the 1st InternationalConference on Communication Systems and Networksand Workshops, 2009.

[Puterman, 1994] Martin L. Puterman. Markov DecisionProcesses: Discrete Stochastic Dynamic Programming.John Wiley & Sons, Inc., 1994.

[Rai and Daume III, 2008] Piyush Rai and Hal Daume III.The infinite hierarchical factor regression model. InAdvances in Neural Information Processing Systems 21,2008.

[Ramachandran and Amir, 2007] Deepak Ramachandranand Eyal Amir. Bayesian inverse reinforcement learning.In Proceedings of the 20th International Joint Conferenceon Artificial Intelligence, 2007.

[Ratliff et al., 2006] Nathan D. Ratliff, J. Andrew Bagnell,and Martin A. Zinkevich. Maximum margin planning. InProceedings of the 23rd International Conference on Ma-chine Learning, 2006.

[Russell, 1998] Stuart Russell. Learning agents for uncer-tain environments (extended abstract). In Proceedings ofthe 11th Annual Conference on Computational LearningTheory, 1998.

[Syed et al., 2008] Umar Syed, Michael Bowling, andRobert E. Schapire. Apprenticeship learning using linearprogramming. In Proceedings of the 25th InternationalConference on Machine Learning, 2008.

[Ziebart et al., 2008a] Brian D. Ziebart, Andrew Maas,James Andrew Bagnell, and Anind K. Dey. Maximum en-tropy inverse reinforcement learning. In Proceedings ofthe 23rd AAAI Conference on Artificial Intelligence, 2008.

[Ziebart et al., 2008b] Brian D. Ziebart, Andrew L. Maas,Anind K. Dey, and J. Andrew Bagnell. Navigate like a cab-bie: probabilistic reasoning from observed context-awarebehavior. In Proceedings of the 10th International Confer-ence on Ubiquitous Computing, 2008.

1293

bayesian nonparametric feature construction for inverse reinforcement learning · 2014-09-30 ·...

Documents