exploration from demonstration for interactive ...isbell/papers/efd-aamas-2016.pdf · selection...

Exploration from Demonstration forInteractive Reinforcement Learning

Kaushik SubramanianCollege of Computing

Georgia TechAtlanta, GA 30332

[email protected]

Charles L. Isbell Jr.College of Computing

Georgia TechAtlanta, GA 30332

[email protected]

Andrea L. ThomazElectrical and Computer

EngineeringUniversity of Texas at Austin

Austin, TX [email protected]

ABSTRACTReinforcement Learning (RL) has been effectively used tosolve complex problems given careful design of the prob-lem and algorithm parameters. However standard RL ap-proaches do not scale particularly well with the size of theproblem and often require extensive engineering on the partof the designer to minimize the search space. To alleviatethis problem, we present a model-free policy-based approachcalled Exploration from Demonstration (EfD) that uses hu-man demonstrations to guide search space exploration. Weuse statistical measures of RL algorithms to provide feed-back to the user about the agent’s uncertainty and use thisto solicit targeted demonstrations useful from the agent’sperspective. The demonstrations are used to learn an ex-ploration policy that actively guides the agent towards im-portant aspects of the problem. We instantiate our approachin a gridworld and a popular arcade game and validate itsperformance under different experimental conditions. Weshow how EfD scales to large problems and provides conver-gence speed-ups over traditional exploration and interactivelearning methods.

1. INTRODUCTIONReinforcement Learning (RL) [36] is the field of research

focused on solving sequential decision-making tasks modeledas Markov Decision Processes. Researchers have shown RLto be successful at solving a variety of problems like com-puter games (Backgammon [39], Atari games [26]), robottasks (soccer [34], helicopter control [1]) and system op-erations (inventory management [16]); however, in general,standard RL approaches do not scale well with the size of theproblem and often require extensive engineering on the partof the designer to minimize the search space. The reasonthis problem arises is that RL approaches rely on obtainingsamples useful for learning the underlying structure with-out always using smart methods to explore the problem. InRL literature this is related to balancing the exploration-exploitation tradeoff - a central problem in ReinforcementLearning. Existing methods either use a fixed (uniformlyrandom) policy or value-based metrics [6, 35] that eitherresult in redundant exploration and/or have prohibitively

Appears in: Proceedings of the 15th International Conferenceon Autonomous Agents and Multiagent Systems (AAMAS 2016),J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.),May 9–13, 2016, Singapore.Copyright c© 2016, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

expensive sample complexity. In this work we tackle theproblem of smart exploration in RL, by using human in-teraction. We present a model-free policy-based approachcalled Exploration from Demonstration (EfD) that servesto bias an RL agent’s exploration to cover the search spaceeffectively.

In designing an exploration policy for sequential decision-making problems, it is important to help the agent reachparts of the search space that are necessary to model in or-der to solve the problem. These include guiding the agenttowards stochastic regions of the problem or regions of highreward for example. To facilitate exploration in these partsof the domain, we utilize statistical measures of the learningalgorithm. We pose the optimization function as a linear re-gression problem and use relevant measures [27] to help iden-tify influential regions. Such an approach has the advantageof learning to explore the problem from the agent’s perspec-tive while taking into account the underlying representation,the RL algorithm as well as the problem definition. Wesolicit interaction from an oracle (human or simulated) toacquire demonstrations that lead the agent towards theseinfluential regions. We instantiate our approach on a grid-world and a popular arcade game (Frogger) and show howour policy-based approach is able to solve long horizon prob-lems using exploratory demonstrations while outperformingtraditional exploration and interactive learning methods.

2. RELATED WORKThere has been extensive work related to exploration in

RL. With regard to the approach presented in this paper,we broadly categorize the existing work into two aspects -automatic exploration in RL and interactive machine learn-ing. We present details on the related work in these fieldsto provide context to our approach.

2.1 Automatic Exploration in RLRmax [31] is an automatic approach introduced to per-

form exploration using the idea of optimism in the face of un-certainty. The approach performs well in practice, howeverR-max scales exponentially in the number of state variableswhich makes it intractable for sufficiently large problems.There are several value-based like UCB [6] and others [24, 7,11] and model-based methods [35, 29, 18, 19], studied in thecontext of multi-arm bandits, that perform effective explo-ration by maintaining statistics about changes in the valuefunction and the number of times state-action pairs havebeen visited. While successful in smaller domains, these

approaches run into sample complexity issues when dealingwith the high-dimensional long horizon domains we aim tosolve.

A smart exploration method was proposed by Gehringand Precup [14] which uses the residual (TD error) as a re-ward for a Q-function, whose implied policy is then used forexploration. The intuition behind this approach being thatstate-action pairs that have high residuals should be visitedand tried more often. This method relies on stable estimatesof the residual. We adapt a version of this approach in ourexperiments. Using an internal reward function [9] is an in-teresting approach to exploration in RL. In this method, auser is required to design a reward function for skill learningwhich is often non-trivial. While the authors provide em-pirical results on small-sized domains, the idea presented ispromising and warrants further exploration. An approachclosely related to work presented in this paper is that ofActive RL [13] where the authors model the problem as aPOMDP. They model the sensitivities of the policy to theunknown transition and reward function and build explo-ration strategies focusing on these aspects of the problem.This method relies on using Newtons method to solve theproblem till convergence (which cannot be guaranteed) andthey need to solve the MDP numerous times to test the sen-sitivity. The work most closely related to our work is oneauthored by Akiyama et al. [3] where they leverage conceptsof least squares approaches to guide exploration in policy it-eration methods. The main difference is that the approachthey design requires the problems to have certain propertieswith respect to the reward distribution and as such it is notdirectly comparable.

2.2 Interactive Machine LearningThere exists is a wide variety of work in the field of Inter-

active Machine Learning, namely Learning by Demonstra-tion [4], Imitation Learning [30], Policy Shaping [17] andTAMER [22]. Work by Knox and Stone [22, 23] has shownthat a general improvement to learning from human feed-back is possible if it is used to directly modify the actionselection mechanism of the Reinforcement Learning algo-rithm. Although both approaches use human feedback tomodify an agent’s exploration policy, they still treat humanfeedback as either a reward or a value. In our work, we usehuman interaction to directly learn a policy.

Active Reward Learning [10] has been used to learn a re-ward function from human feedback and use that in an RLalgorithm. They use the human to provide input on task ex-ecutions - a score to the execution - that they then smoothusing Gaussian Processes and Bayesian Optimization. Re-ward function design in general is known to be a hard prob-lem as there are always possibilities of loops in the learnedpolicy. A paper similar in theme to the work in this paper isone on active imitation learning by state queries [20]. Theauthors present an approach where the human interacts withthe agent by giving a optimal action in a specific state or bysaying that the state is bad. The query-states are chosen bya query-by-committee approach based on Bayesian ActiveLearning. In their approach, they assume the learner hasaccess to a simulator of the MDP and also do not explicitlyhandle the case where humans provide a bad state response.

In other works, rather than have the human input be areward shaping input, the human provides demonstrationsof the optimal policy. Several papers have shown how the

policy information in human demonstrations can be usedfor inverse optimal control [28, 2], for teaching [8], to seedan agent’s exploration [5, 38], and in some cases be usedentirely in place of exploration [21, 33]. DAgger [32] is no-regret online learning approach used for supervised learning.The method learns from training data and then executes thelearned model. For every mistake made, the human demon-strator provides more examples in that space. These exam-ples are appended to the training set and the learner is re-trained. While it is not strictly an RL approach, it is mainlyproviding examples useful for a supervised learner. We notehere that all of the methods presented here require/assumethe human provides optimal information which is not nec-essary in our approach.

3. BACKGROUND AND PRELIMINARIESReinforcement Learning (RL) defines a class of algorithms

for solving problems modeled as a Markov Decision Pro-cess (MDP). An MDP can be represented as a tuple M =〈S,A, T , R, γ〉 with states S, actions A, transition functionT : S×A 7→ Pr[S], reward functionR : S×A 7→ [Rmin, Rmax],and discount factor γ 7→ [0, 1]. A policy π : S 7→ Pr[A] de-fines the probability of selecting an action in a state. Fora given MDP, a Q-function Qπ(s, a) represents the expectedlong-term reward of taking action a in state s, and follow-ing policy π thereafter. Mathematically, the Q-function iscomputed as

Qπ(s, a) = R(s, a) + γ∑s′

T (s, a, s′)∑a′

π(s′, a′)Qπ(s′, a′)

(1)The optimal action-value function isQ∗(s, a) = max

πQπ(s, a),

and the solution to an MDP is any optimal policy π∗, whichmaximizes the expected reward for every state.

In this paper we use Q-learning with linear function ap-proximation [37]. The Q-function is represented as a linearfunction of state-action features, Q(s, a) = φ(s, a)T θ. As-suming φ(s) are the features used to represent the statesin an MDP, we obtain φ(s, a) by duplicating the featuresφ(s) for all actions and only activate the ones for the ac-tion under consideration. For example if |A| = 4, φ(s, 3) =[0,0, φ(s),0]T . A set of transitions made by the RL agentfrom a starting state to any terminal state comprises a singleepisode. For every transition made by the agent in the form〈s, a, r, s′〉, where r is the reward obtained for taking actiona in state s and transitioning to s′, the Q-learning algo-rithm updates its weight vector, θ using first-order gradientmethods in the following manner:

δ = r + γmaxa′∈A

[φ(s′, a′)T θ]− φ(s, a)T θ

θ = θ + α δ φ(s, a)(2)

Here α 7→ [0, 1] is the learning rate. The error, δ is theTemporal Difference (TD) error (also loosely referred to asBellman error). The algorithm, while not guaranteed toconverge in the general case, is found to perform well inpractice.

Exploration in RL is commonly achieved using two meth-ods: ε-greedy and softmax action selection. In ε-greedy ac-tion selection, exploration is performed by uniformly sam-pling a random action with probability ε and the currentbest action (greedy action selection) with probability 1− ε.The choice of ε is left to the designer. In some cases, a de-

cay schedule is used where the value of ε is decayed overtime as the agent gains more domain experience. Softmaxaction selection takes a more informed approach to explo-ration. Instead of sampling a random action, the actionsare weighted according to their respective Q-value estimatesand sampled from the resulting distribution. The most com-mon implementations use a Boltzmann distribution in the

following manner: πB = eQ(s,a)/τ∑|A|a=1 Q(s,a)/τ

where τ is a positive

temperature parameter. A higher temperature value resultsin a more uniform distribution while a lower value results ingreedy action selection.

4. APPROACHIn this section we describe an interactive approach to ex-

ploration in reinforcement learning using statistical proper-ties relevant to the underlying learning algorithm. We out-line properties of exploration policies, statistical measuresuseful for this purpose and how these measures can be usedto solicit interaction that drives the agent’s exploration.

4.1 Exploration PoliciesFor sequential decision-making problems, exploration poli-

cies are used to guide the agent to different parts of thesearch space so as to obtain good estimates of the valuefunction while covering as much of the state-action space aspossible. There are a number of factors affecting this explo-ration such as the dynamics of the domain, the sparsity ofrewards, the size of the state-action space and the problemhorizon (steps to goal). In addition to these properties, asubtle but important aspect of exploration that is implicit inexisting methods is that exploration policies are not strictlystationary. As the agent gathers more information aboutthe world, the exploration policy changes to accommodatethe learned model.

Traditional methods, as explained earlier, explore using avariety of methods ranging from a uniformly random policyto value-based heuristics. These methods are prone to re-dundant exploration and (in some cases) expensive samplerequirements. The idea of inefficient exploration also appliesto methods involving human interaction, as human data isoften limited to specific regions of the search space while re-lying on the learning algorithm to generalize effectively. Toaccount for the characteristics of exploration policies whileovercoming the limitations of existing methods, we presenta policy-based approach to exploration. We solicit demon-strations based on the agent’s uncertainty about its modelto guide the agent to cover the search space more efficiently.For a given MDP, model uncertainty arises from a combi-nation of stochastic elements in the domain (transition andreward function) and insufficiently explored states and ac-tions. Keeping this in mind, we investigate properties ofrelevant RL algorithms and select measures that serve tocharacterize the model uncertainty with the goal of design-ing effective exploration policies.

In our work we use Q-learning with linear function ap-proximation which uses gradient methods to perform opti-mization. The loss function used is akin to to minimizingthe squared loss of the Bellman error [15]. Using this infor-mation, we use statistical properties of analogous methodsthat have been well-studied, like linear regression or leastsquares, to understand the impact of each data point or ob-servation on the learning agent’s model.

4.2 Statistical MeasuresIn order to understand the agent’s model uncertainty, we

review statistical analysis of linear regression methods [27]to find measures useful for our purposes. In linear regressionproblems, input observations can be scored by their influ-ence to measure the effect they have on the learned model. Ahigh influence score points towards observations that meritfurther investigation. Influence is computed as a combina-tion of two measures: Leverage and Discrepancy. Leverageis a measure of how far a specific observation is from theconvex hull of known observations. It helps recognize out-liers. Discrepancy is related to how much an observationcontributes towards model error. From the perspective ofexploration in RL, these measures help to identify parts ofthe state-action space that have not been explored (usingLeverage) and observations that lead to high model error(using Discrepancy). A key insight into using these type ofmeasures for RL is that the data the agent trains on doesnot contain any outliers as every observation made by theagent, by interacting with the domain, is relevant to solvingthe MDP. This indicates that there is essentially no datato be discarded. We hypothesize that an RL agent thatactively explores the observations that have high leverageand high discrepancy, i.e. overall high influence, will lead tomore efficient exploration to solve the MDP.

In order to utilize these statistical measures we explic-itly set up the problem as a linear set of equations thatcorrespond to the standard form, Xβ = y. An RL agentis solving the MDP to optimize the function, Q(s, a) =R(s, a) + γmax

a′Q(s′, a′) where Q(s, a) = φ(s, a)T θ. It is

straightforward to see that the left-hand side of the opti-mization function, Q(s, a) or φ(s, a)T θ takes the place ofXβ and the right-hand side forms y. The input data forour approach comes from transitions of the RL agent as it isattempting to solve the problem. The state-action features,φ(s, a) observed by the agent during transitions are used topopulate the rows of the data matrix, X (n × k for n ob-servations and k features). Given this formulation we definethe statistical measures useful for exploration.

4.2.1 LeverageLeverage is a measure useful to determine how well the

state-action space has been covered as it detects outliersin the data. Given independent variables, X, we computeleverage, h using the hat matrix, H as follows [27]:

H = X(XTX)−1XT (3)

The hat matrix maps the vector of dependent variables (y)to the vector of fitted values, y = Hy. The diagonal elementsof the hat matrix, hii are the leverages, which describe theinfluence each dependent variable value has on the fittedvalue for observation i. Leverage values are in the range[0, 1]. A high value indicates that the observation is an out-lier and vice versa. A fixed threshold parameter is used todetect the presence of outliers. We use 0.5 as the cut-off toindicate if an observation is an outlier. For RL problems, ahigh leverage indicates that the respective state-action pairis an outlier, i.e. is novel and has not been visited often.

Typically RL algorithms require large amounts of data tosolve the MDP which makes it infeasible to store all thetransitions in a batch. In addition to that, computing lever-age can pose computational issues as it requires taking the

inverse of a matrix of size k × k (for k features) which canbe very large for high-dimensional problems. To address thememory and computational concerns, we use the Sherman-Morrison formula to incrementally compute the inverse ofXTX. The formula is stated as follows:

(A+ xTx)−1 = A−1 − A−1xTxA−1

1 + xA−1xT(4)

where A−1 is initially set to 1δI (identity matrix of size k ×

k) and δ is a small positive number, say 1e−4. Using thiscomputation, the leverage of an instance x of data matrixX can be computed as xA−1xT .

4.2.2 DiscrepancyThis measure captures the observations that the learning

algorithm is unable to model thus leading to large errors.Discrepancy is computed using the externally studentizedresidual [27]. These residuals are obtained by computingthe residual for an observation and dividing it by the stan-dard error (or standard deviation). This is done to reducethe effect of the variance in the errors and allow residuals tobe compared. An externally studentized residual is one thatcomputes the residual by taking into account the differencein the learned model with and without the observation inquestion. For observation i, the discrepancy, ti can com-puted as:

ti =ei√

MSE(i)(1− hii)

MSE(i) =(n− p)MSE − e2i

(1−hii)

n− p− 1

(5)

Here MSE represents the mean-squared error, n is the num-ber of samples, p the number of independent variables, hii isthe leverage for observation i and ei is the TD error for sam-ple i. MSE(i) is the mean squared error for the model basedon all observations excluding sample i. We note that MSEis typically computed using batch data which is computa-tionally infeasible to store in large scale RL domains. It doesnot lend itself to incremental computations due to the maxoperator in the RL optimization function (when computingMSE and ei). We circumvent this problem by storing abatch data matrix, update it with new observations usinga (FIFO) sliding window and compute the required param-eters [12] online. In analysis of linear systems, when theabsolute value of the externally studentized residual, |ti| isgreater than 2, the corresponding observation is consideredan outlier that needs further investigation. Henceforth weuse the term discrepancy to represent the externally studen-tized residual.

Using these measures the RL agent can identify obser-vations in the MDP that require further exploration. Wenow describe how we use them to solicit demonstrations forexploration.

4.3 Demonstration QueryFor every transition made by the RL agent, it computes

the leverage and discrepancy and compares it to the respec-tive thresholds. If either threshold is exceeded, we identifythe corresponding observation as influential. To learn moreabout that observation and reduce its influence, we learn apolicy using guidance from a person or a simulated oraclethat drives the agent towards these observations.

Consider the state associated with an influential observa-tion the agent transitioned into as s+. To encourage ex-ploration to s+, it is important to bridge the gap betweenregions of low influence to those of high influence. Intuitivelythis can be explained by the idea that state-action pairs thathave low influence are likely to have been frequently visitedand sufficiently explored. Therefore designing a policy fromparts of the state-action space that the agent knows well andvisits often to those with high influence is most likely to en-courage exploration to and around s+. As explained before,leverage provides a way to identify data points that havebeen visited often. To acquire the necessary low influencedata point that is to be connected to s+, we review everyobservation i, in the current episode and compute the cor-responding leverage: hi = φ(si, ai)A

−1φ(si, ai)T . We then

compute the mean leverage, µh from these observations andfind the state, si that corresponds to the data point withthe closest leverage,

argmini|hi − µh| (6)

Once the low and high influence observations have been iden-tified, we collect exploratory demonstrations either from aperson or a simulated oracle using a Graphical User Inter-face (GUI) for the domain. When the algorithm queries fordemonstrations, the GUI highlights the states that need tobe connected by demonstrations. Using the GUI, the usercan a) provide demonstration(s), b) choose to ignore thequery and c) stop interacting with the algorithm altogether.The simulated oracle provides demonstrations by followingthe shortest distance path between the queried states. Thereare no inherent assumptions made about quality or quantityof demonstrations. The only requirement is for the user to beknowledgeable about the MDP dynamics to help the agentnavigate in the domain. For every demonstration provided,we learn an oracle exploration policy πO using standard su-pervised learning algorithms and sample an action from thispolicy when the agent decides to explore.

We note here that when soliciting demonstrations, the fi-nal state in the demonstration may be different from thequery state requested by the agent. This is likely to beobserved in domains with stochastic elements and/or non-playable characters. While there is no straightforward wayto ensure a certain state is visited in an MDP, our exper-imental results show that the policy learned using our ap-proach is effective at driving the agent towards influentialparts of the MDP as it continues to actively request userdemonstrations.

4.4 Action SelectionThe oracle exploration policy, πO defines a policy that

when followed is likely to guide the agent from regions oflow influence to those of high influence. However using sucha policy alone to explore in and of itself can be insufficientfor the purposes of RL where the goal is to arrive at theoptimal policy as soon as possible. Additionally the natureof exploration is non-stationary and as such if there are lim-ited demonstrations, the agent is less likely to explore theset of influential regions in the MDP. To account for theseproperties, we design our exploration policy as follows:

πE ∝ (πO + πL) · πB (7)

where πO is the oracle exploration policy, πL is the leverage

Algorithm 1 Exploration from Demonstration (EfD)

repeat(for each episode):Initialize srepeat(for each step of episode):

Compute πE (Eqn. 7)Choose a (ε-greedy action selection using πE)Take action a, observe r, s′

Store transitions 〈s, a, r, s′〉Update θ and A−1 (Eqn. 2 & 4)Compute leverage and discrepancy (Eqn. 3 & 5)if high influence at s then

s+ ← sCompute starting state, si (Eqn. 6)Query demonstrations from si to s+

Update θ and A−1

Self-play for Ts (includes parameter updates)end ifs← s′

until s is terminalDecay ε

until end of learning

value ∀a ∈ A for state s and πB represents the Boltzmannexploration policy. We note that the leverage values lie inthe range [0, 1] and for our purposes can be used as proba-bilities. A leverage value closer to 1 will have the effect ofsampling the corresponding action more often. IntuitivelyπE represents the exploration policy that chooses betweenthe oracle demonstration or the leverage values, weightedby the softmax Q-values of the actions in the state. Thisexploration policy allows the agent to reach regions of highinfluence using human demonstrations or select actions withhigh leverage while actively seeking the goal. We use πE inan ε-greedy fashion to facilitate exploration in our approach.

4.5 Exploration from DemonstrationWe now outline our approach with all the pieces defined

using Algorithm Block 1. The objective of Exploration fromDemonstration (EfD) is to learn the optimal policy usingRL while ensuring the agent actively explores regions of thestate-action space that have a potentially large influence onthe learned model.

EfD as described in Algorithm Block 1 has a tendency toquery the user demonstrations repeatedly as high influenceregions are often in close proximity to others. This results inthe leverage and discrepancy thresholds being crossed veryoften within the same episode. In order to make EfD moreuser-friendly we include a predefined fixed time period, Tswhere the agent executes self-play without any user queries.During self-play Q-learning and leverage parameters (θ &A−1) continue to be updated. We note that EfD does notmodify any theoretical guarantees of the methods used asQ-learning is an off-policy algorithm. This completes thedescription of our approach.

5. EXPERIMENTAL SETUPTo validate the performance of EfD we conduct experi-

ments on a gridworld and popular arcade game domain andcompare our method to several baselines. In this sectionwe describe the domains used in our experiments and therelevant baselines.

Slippery

Slippery

Slippery

Isolated

Isolated

(a) Gridworld (b) Frogger1x

Figure 1: A snapshot of the two domains used in our exper-iments.

5.1 DomainsWe use two domains to empirically highlight the perfor-

mance and properties of EfD. We represent these domainsas MDPs in the following manner:

Gridworld. This domain is designed by adapting the spec-ifications outlined in [14]. We implement an 18×18 discretegrid (Figure 1a) with four deterministic actions that movethe agent up, down, left and right. The goal is to reachthe top right corner of the grid. For every step taken, theagent accrues a step cost of −1 and a reward of 0 at the goalstate. The blue shaded regions represent slippery squares. Ifthe agent transitions out of a slippery square (both into anunshaded square or another slippery square), the reward isuniformly distributed in the interval [−12,+10]. The grayshaded regions represent isolated squares. Any transitionthat leads the agent into this region from an empty squarehas a 0.1 probability of success. Once inside, movementswithin the isolated region as well as those leading out arenot restricted. Transitions leading into the bounding wallskeeps the agent’s state unchanged. An episode starts withthe agent randomly placed in the grid and stops when theagent reaches the goal. We used identity features to repre-sent the state space.

Frogger. In the game of Frogger (Figure 1b), the agent’sgoal is to navigate from the bottom to the top of the gridwhile avoiding the cars and water pits (shown as dark squaresin the top row). The cars move in straight lines in their in-dividual rows either going left or right. The direction israndomly chosen at the start of an episode and stays fixeduntil the episode ends. The intermediate gray row(s) rep-resent safe zones for the agent, i.e. no cars and water pits.The agent has 5 actions: 4 directional actions and a no-opaction. Within an episode, the transitions are determinis-tic. The reward function is +1000 for reaching the goal,−100 for dying (directly hitting a car, crossing over a car,falling into water) and 0 everywhere else. An episode startswith the agent in a random position in the bottom row andstops when the agent dies or reaches the goal. The statespace of the domain consists of the agent’s position alongwith the position and direction of travel of cars in the gridand represented using binary features. The Frogger domainis particularly interesting as it lends itself to easy scalabil-ity. The domain can be made more complex, i.e. have alonger solution horizon, by increasing the number of inter-mediate rows between the start and goal positions. We usethis property to show how our method scales with the size of

the domain. We refer to the domain configuration in Figure1b as Frogger1x and use Frogger2x to indicate doubling thenumber of rows used in Frogger1x and Frogger4x to indicatea quadruple version of the same.

5.2 BaselinesWe implement five baselines in our experiments and com-

pare their performance to EfD. Uniform random explorationand softmax exploration comprise two of the baselines. Theremaining three are defined as follows:

Learning from Demonstration + RL. In this methodwe acquire demonstrations of optimal behavior from peo-ple or a simulated oracle. These demonstrations are usedto learn a policy using supervised learning methods (in thiscase logistic regression). We use this policy as the seed pol-icy to initiate RL. We execute the algorithm on the domainand report the results.

Exploration by TD error. This approach draws frominsights highlighted in this paper [14] and learns an ex-ploration policy based on TD error. In our implementa-tion we use the current estimate of TD error (the absolutevalue) as the reward for a given state-action pair and learna Q-function using this information. A policy is extractedfrom the Q-function using softmax action selection. The Q-function learned in this process plays the role of driving theagent towards parts of the state-action space that have highTD error in order to gather more information in those re-gions. We note that this approach is not strictly consistentwith standard MDP assumptions as the reward function isnon-stationary (TD error is constantly changing). While wecounteract this effect to a certain degree by using a smalllearning rate and a decaying exploration parameter, our ex-periments show that performance was not greatly affectedby this characteristic.

Exploration by Leverage. We derive an exploration pol-icy by computing the leverage on data consisting of visitedstate-action pairs. For any given state, we compute theleverage for all actions, normalize the results and use it asa distribution from which we sample exploratory actions.As explained earlier leverage captures outliers in the data,which in this case would represent actions, for a given state,that have not been tried often. This way by sampling fromnormalized leverage values for all actions in a state, the agentis more likely to sample new actions. This baseline is usefulto signify the importance of exploratory demonstrations.

6. EXPERIMENTS AND RESULTSWe implement EfD for the chosen domains and highlight

the results achieved along with several tests that provideinsight into EfD’s performance under different experimentalconditions.

6.1 Using Leverage and Discrepancy for Ex-ploration

In this experiment we use the gridworld (Figure 1a) toshow the utility of leverage and discrepancy as useful mea-sures to guide exploration. The gridworld is suitable forthis purpose due to several design choices made. Firstlythe isolated (gray) regions in the grid represent parts of the

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) Leverage heat map

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

(b) Discrepancy heat map

Figure 2: Leverage and discrepancy heat maps generatedfrom random exploration of the gridworld domain.

state-space that are hard to reach and thus unlikely to beexplored by the agent. Secondly the non-deterministic re-ward function (represented by slippery blue patches on thegrid) pose some difficulty to the learning algorithm in accu-rately modeling the underlying value-function. We conducttwo experiments to show the individual contribution of thechosen exploration measures.

To highlight the utility of leverage, we perform a randomwalk in the gridworld for 50 episodes. In each episode, theagent starts in a random position and moves randomly untilthe goal is reached. For every step taken, features of visitedstate-action pairs are used to form the data matrix X. UsingX, we compute the hat matrix H (see Equation 3) and usethat to derive the leverage h(s, a) for every state-action pair.

In Figure 2a, we plot a heat map using maxa

h(s, a) ∀s ∈ S.

The heat map shows the correspondence between the regionsof high leverage (h ≥ 0.5) and hard to explore regions of thegridworld (isolated gray regions). Leverage, as explainedearlier, is used to identify outliers in the data. In this casethe heat map presents the gray region as outliers which ne-cessitates the need for further exploration. Analogously thiseffect can be seen in other more complex high-dimensionaldomains where it is hard to entirely cover the state-actionspace. Leverage, as shown here, captures regions that thelearning agent is unable to reach often.

While leverage captures how often state-action pairs havebeen visited, it does not capture details about the transitionfunction, reward function and RL algorithm’s learned model.Here we show how discrepancy is useful for this task. Weperform Q-learning with linear function approximation onthe gridworld domain for 50 episodes. We set γ = 0.99,α = 0.1 and ε = 1.0. We store all the transitions 〈s, a, r, s′〉from the sampled episodes and compute the mean squarederror using the learned weight vector θ (refer Equation 5).The mean squared error is used to derive the discrepancyt(s, a) for every state-action pair.

In Figure 2b, we plot a heat map using maxa|t(s, a)| ∀s ∈

S. The heat map shows the correspondence between theregions of high discrepancy (|t| > 2) and the slippery regionsin the gridworld. Intuitively this is to be expected as theTD error in these areas is likely to have large magnitudeand high variance and that warrants further investigation.We note that the bright red patch in the top right corner ofthe heat map signifies the high residual obtained at the goaland its adjoining states. Using this heat map as a thresholdfor exploration would draw the agent towards the slipperypatches and the goal until the learning algorithm capturesthe underlying model and the residual decreases.

Figure 3: Performance of EfD and baselines on the Frog-ger game domain of varying sizes (1x, 2x and 4x) averagedover 10 trials. The numbers in parenthesis are the averagenumber of input user demonstrations.

The experiment serves to highlight the roles played byleverage and discrepancy and how they guide the agent’sexploration. Leverage guides the agent towards state-actionpairs that have not been visited often during learning andthe discrepancy guides the agent towards regions of the do-main which the learning algorithm has difficulty modelingthe value-function.

6.2 EfD for FroggerIn this experiment we instantiate the EfD algorithm in

the Frogger domain for different sizes of the problem, Frog-ger 1x (10 rows), Frogger 2x (18 rows) and Frogger 4x (34rows). We perform Q-learning with linear function approx-imation with γ = 0.99, α = 0.0006 and εstart = 0.8. We usethe following decay schedule for the exploration parameter:ε = εstart×N0

(N0+Ep#)where N0 is the decay rate and Ep# is the

current episode number. We set N0 as 1500, 2500 and 5000respectively for the three versions of Frogger. The thresholdparameters for EfD were fixed with the leverage thresholdset at 0.5 and discrepancy threshold at 2. The self-play timeperiod for EfD was set to Ts = 500, Ts = 2000 and Ts = 5000steps respectively for the three versions of Frogger. The tem-perature for softmax Boltzmann exploration was set to 50.We acquired demonstrations from two users who were bothfamiliar with the dynamics of the game. We received any-where from 5 to 30 demonstrations depending on the size ofthe domain that was being tested. Demonstration time intotal was no more than 10 to 15 mins. The human policywas learned using logistic regression with a learning rate of0.01. We plot the results of this experiment along with com-parative baselines in Figure 3. We see that EfD convergesto the optimal policy faster than the baselines using a smallnumber of demonstrations. To ease readability we plot onlya subset of the baseline methods in Figure 3 as the perfor-mance of the baselines (TD-error, Leverage and Softmax)were consistent across the three sizes. The performance isfurther improved over the baselines as the size of the do-main is increased. This is explained by highlighting howEfD queries and utilizes user demonstrations. In the initialstages of learning, the user is queried with demonstrationsleading to states in the rows closer to the bottom row. Asthe agent gains experience, demonstrations are requested forstates further up. In this process, the agent incrementallyexplores the rows until it finally reaches the goal in the top

(a) Agent queries the user fordemonstrations to the high-lighted position based on theleverage threshold.

(b) Agent queries the user fordemonstrations to the high-lighted position based on thediscrepancy threshold.

Figure 4: Examples of the types of states queried by theagent during EfD when applied to Frogger2x (18 rows).

row. Such an incremental learning approach makes it eas-ier for agent to reach the goal as well as easier for the userto provide demonstrations. This also explains why LfD (+RL) does not perform as well as EfD for larger grids. Thesize of the domain limits the search space covered by thehuman demonstrations as well as prohibits optimal demon-strations from start to the goal. EfD performs better byusing the RL algorithm’s inductive bias as well as the under-lying representation to acquire incremental demonstrationsthat are most useful to the agent. These results were consis-tent across both users. The TD-error and leverage baselinemethods while more informed do not perform as well due totheir redundant exploration. An interesting observation ofEfD from our experiments is that, by using thresholds forthe statistical measures, with sufficient experience the agentautomatically ceases to request demonstrations. In whichcase, we observe that the agent has enough information tomodel the Q-function and solve the MDP.

6.3 Types of States QueriedHere we take a closer look at the types of states queried

by the agent during EfD. We present results from the Frog-ger2x domain which consists of 18 rows from start to goal.Figure 4a is an example of an agent query, based on the lever-age measure, where a demonstration is required between thefrog near row 8 to the highlighted grid position near row4. An observation that exceeds the leverage threshold indi-cates that the agent has not visited that state-action pairoften and therefore requires input demonstrations. Such aquery points towards how EfD gathers information about thestate-action - decomposing the domain in smaller regions.Demonstrations are requested from known regions to un-known regions and often they are in close proximity to eachother. Providing a demonstration for such a query wouldbe easier than providing optimal demonstrations from startto end in this domain. Figure 4b is an example of an agentquery based on the discrepancy threshold. We note thatthe highlighted position is around a car near row 6 which isa terminal state with reward −100. The discrepancy hereexceeded the threshold as the agent’s current model was un-able to make an accurate prediction of the Q-value of anaction in this state and thus requested a demonstration.

We would also like to highlight a few uncommon queries

that provide interesting insights into the method. In somecases the agent requests demonstrations from a state wherethe frog is closer to the goal to states where the frog isfurther away. From the perspective of solving the MDP,using such a policy would encode suboptimal information,however for policy-based exploration, it only serves to geta better estimate of the Q-function. Note that explorationis carried out by combining the human policy with softmaxpolicy (Section 4.4) which therefore ensures the agent selectactions that are more likely to lead it towards the goal. Forsome queries we observe that the agent’s position on thegrid remains the same for both start and final states, whilethe position of cars is different. With respect to EfD, theseare different states and therefore it is a valid demonstrationquery. This shows how EfD makes demonstration queries bytaking into account the underlying representation.

6.4 Effect of Input Demonstrations and Thresh-old Parameters

In this experiment, we test the sensitivity of EfD to thequality of demonstrations used to learn the exploration pol-icy. While we do not place any assumptions on the qualityof demonstrations, we analyze the degrees to which perfor-mance is affected as the quality of demonstration is varied.We use the simulated oracle for this experiment under dif-ferent demonstration noise conditions: Oracle0.1, Oracle0.3,Oracle0.5. An oracle with noise 0.1 (Oracle0.1) will pro-vide the required demonstration 90% of the time and 10%of the time, take random actions. The results of this ex-periment are summarized in Table 1. As evidenced by the

Frogger1x (5) Frogger2x (12) Frogger4x (23)

Oracle0.0 2560 ± 150 3194 ± 230 4430 ± 410Oracle0.1 2752 ± 321 3470 ± 527 4893 ± 564Oracle0.3 2648 ± 469 3304 ± 699 5218 ± 866Oracle0.5 5102 ± 932 6329 ± 875 7688 ± 1043ε-greedy 6570 ± 120 8555 ± 212 10000 ± 405

Table 1: EfD performance as a function of the quality of in-put demonstrations from a simulated oracle in the Froggerdomain. The values represent the number of episodes takenby each method to converge to the optimal policy. The num-bers in parenthesis are the number of input demonstrationsthe simulated oracles are limited to. We include results fromthe ε-greedy baseline for comparison. The results are aver-aged over 10 trials.

table, the performance of EfD varies based on the quality ofinput demonstrations. Relative to Oracle0.0 (optimal ora-cle demonstrations), Oracle0.1 and Oracle0.3 achieve simi-lar performance. This is explained by the fact that for mostquery demonstrations there exist multiple ways to reach thedesired state and neither path is any more optimal than theother from the perspective of exploration. By introducingnoise in the oracle demonstrations, they can potentially ex-plore more states than Oracle0.0 which can include bothlow and high influence observations. However this has theeffect of increased variance in performance for noisy simu-lated oracles. Oracle0.5 (with 50% random action selection)has large variance in its performance but still outperformsε-greedy uniform random exploration.

In our experiments, we set the leverage threshold at 0.5and the discrepancy threshold at 2. Changes to these param-

eters directly affect the number of demonstrations queriedby the agent which affects the amount of exploration carriedout by the agent. Higher values results in fewer demonstra-tion queries and as a result most of the exploration is carriedout by the agent autonomously. On the other hand lowerthresholds result in frequent queries which has the effect oflearning an exploration policy close to a uniform policy. Ingeneral, from tests in our domain, we find that setting lever-age threshold to 0.5 and discrepancy threshold anywhere inthe range [2, 6] provides the best results.

7. DISCUSSION AND CONCLUSIONHere we highlight the benefits of using a policy-based

approach for exploration in RL in the context of EfD. Whenfaced with large domains with sparse rewards and long hori-zons, a policy-based approach is less vulnerable to the largesample requirements of value-based methods as the informa-tion acquired from a single demonstration allows the agentto extends its range of exploration over multiple timesteps.Additionally such a method does not concern itself solelywith reward information. The statistical measures used inEfD (leverage and discrepancy) focus on different aspects ofthe MDP which allow the algorithm to function well acrossa wider class of problems. This is in contrast to value-basedmethods which rely on large samples of reward informa-tion to estimate the uncertainty in the value function oftenmade complicated in sparse reward and long horizon do-mains. Also EfD does not require optimal demonstrations tolearn but instead demonstrations that serve to connect tworegions of the agent’s choice. As these demonstrations areused for exploration, they can be potentially noisy (whichmay in some cases help the agent).

In this paper we presented a model-free policy-based ap-proach called Exploration from Demonstration (EfD) thatperforms interactive exploration for RL algorithms. Ourmethod adapts statistical measures of linear regression tocapture aspects of an MDP that are important to exploreand model in order to learn the optimal Q-function. Wehighlight the properties of these measures in an instructionalgridworld MDP and empirically test our approach on a pop-ular arcade game under different experimental conditions.We show how EfD scales to larger problems and outper-forms baselines using only exploratory demonstrations whileplacing very few requirements on the quality and quantityof input data. Our method is particularly suited to prob-lems which have a long horizon and sparse rewards as wellas those domains where optimal demonstrations are hard toacquire. In the future we would like to extend EfD to learna model of the MDP, thus allowing the algorithm to requestexamples from arbitrary states rather that waiting to tran-sition to those areas. Another interesting avenue for futurework is the idea of extending EfD to work with more dataefficient RL algorithms like Least Squares Policy Iteration[25].

AcknowledgmentsWe thank the reviewers for their helpful comments in im-proving the paper. This work is supported by ONR grantNo. N000141410003.

REFERENCES[1] P. Abbeel, A. Coates, and A. Y. Ng. Autonomous

helicopter aerobatics through apprenticeship learning.International Journal of Robotics Research,29(13):1608–1639, 2010.

[2] P. Abbeel and A. Y. Ng. Apprenticeship learning viainverse reinforcement learning. In Proc. of the 21stICML, 2004.

[3] T. Akiyama, H. Hachiya, and M. Sugiyama. Efficientexploration through active learning for value functionapproximation in reinforcement learning. NeuralNetworks, 23(5):639 – 648, 2010.

[4] B. Argall, S. Chernova, M. Veloso, and B. Browning. Asurvey of robot learning from demonstration. Roboticsand Autonomous Systems, 57(5):469–483, 2009.

[5] C. Atkeson and S. Schaal. Learning tasks from a singledemonstration. In Proc. of the IEEE ICRA, pages1706–1712, 1997.

[6] P. Auer. Using confidence bounds forexploitation-exploration trade-offs. Journal ofMachine Learning Research, 3:397–422, Mar. 2003.

[7] S. Bubeck, R. Munos, and G. Stoltz. Pure explorationin multi-armed bandits problems. In ALT, volume5809 of Lecture Notes in Computer Science, pages23–37. Springer, 2009.

[8] M. Cakmak and M. Lopes. Algorithmic and humanteaching of sequential decision tasks. In AAAIConference on Artificial Intelligence, 2012.

[9] O. Simsek and A. G. Barto. An intrinsic rewardmechanism for efficient exploration. In Proceedings ofthe 23rd International Conference on MachineLearning, ICML ’06, pages 833–840. ACM, 2006.

[10] C. Daniel, M. Viering, J. Metz, O. Kroemer, andJ. Peters. Active reward learning. In Proceedings ofRobotics: Science & Systems (R:SS), 2014.

[11] R. Dearden, N. Friedman, and S. Russell. BayesianQ-learning. In Proc. of the 15th AAAI, pages 761–768,1998.

[12] T. G. Dietterich. Machine learning for sequential data:A review. In Structural, Syntactic, and StatisticalPattern Recognition, pages 15–30. Springer-Verlag,2002.

[13] A. Epshteyn, A. Vogel, and G. DeJong. Activereinforcement learning. In ICML, volume 307 of ACMInternational Conference Proceeding Series, pages296–303. ACM, 2008.

[14] C. Gehring and D. Precup. Smart exploration inreinforcement learning using absolute temporaldifference errors. In International conference onAutonomous Agents and Multi-Agent Systems,AAMAS, pages 1037–1044, 2013.

[15] M. Geist and O. Pietquin. Algorithmic survey ofparametric value function approximation. IEEE Trans.Neural Netw. Learning Syst., 24(6):845–867, 2013.

[16] I. Giannoccaro and P. Pontrandolfo. Inventorymanagement in supply chains: a reinforcementlearning approach. International Journal ofProduction Economics, 78(2):153–161, 2002.

[17] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell,and A. L. Thomaz. Policy shaping: Integrating humanfeedback with reinforcement learning. In Neural

Information Processing Systems 26, pages 2625–2633,2013.

[18] T. Hester, M. Lopes, and P. Stone. Learningexploration strategies in model-based reinforcementlearning. In Proceedings of the 2013 InternationalConference on Autonomous Agents and Multi-agentSystems, AAMAS ’13, pages 1069–1076, 2013.

[19] T. Hester and P. Stone. TEXPLORE: Real-timesample-efficient reinforcement learning for robots.Machine Learning, 90(3), 2013.

[20] K. Judah, A. P. Fern, T. G. Dietterich, andP. Tadepalli. Active imitation learning: Formal andpractical reductions to i.i.d. learning. Journal ofMachine Learning Research, 15:4105–4143, 2014.

[21] L. P. Kaelbling, M. L. Littmann, and A. W. Moore.Reinforcement learning: A survey. JAIR, 4:237–285,1996.

[22] W. B. Knox and P. Stone. Combining manualfeedback with subsequent MDP reward signals forreinforcement learning. In Proc. of the 9th Intl. Conf.on AAMAS, pages 5–12, 2010.

[23] W. B. Knox and P. Stone. Reinforcement learningfrom simultaneous human and MDP reward. In Proc.of the 11th Intl. Conf. on AAMAS, pages 475–482,2012.

[24] L. Kocsis and C. Szepesvari. Bandit based monte-carloplanning. In ECML, volume 4212 of Lecture Notes inComputer Science, pages 282–293. Springer, 2006.

[25] M. G. Lagoudakis and R. Parr. Least-squares policyiteration. In Journal of Machine Learning Research,volume 4, pages 1107–1149, 2003.

[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis. Human-levelcontrol through deep reinforcement learning. Nature,518(7540):529–533, 02 2015.

[27] D. C. Montgomery, E. A. Peck, and G. G. Vining.Introduction to Linear Regression Analysis (4th ed.).Wiley & Sons, Hoboken, July 2006.

[28] A. Y. Ng and S. Russell. Algorithms for inversereinforcement learning. In Proc. of the 17th ICML,2000.

[29] J. Pazis and R. Parr. PAC optimal exploration incontinuous space markov decision processes. InProceedings of the Twenty-Seventh AAAI Conferenceon Artificial Intelligence, 2013.

[30] B. Price and C. Boutilier. Accelerating reinforcementlearning through implicit imitation. Journal ofArtificial Intelligence Research (JAIR), 19:569–629,2003.

[31] M. T. Ronen I. Brafman. R-max - a generalpolynomial time algorithm for near-optimalreinforcement learning. Journal of Machine LearningResearch, pages 213–231, 2002.

[32] S. Ross, G. J. Gordon, and D. Bagnell. A reduction ofimitation learning and structured prediction tono-regret online learning. In AISTATS, volume 15 ofJMLR Proceedings, pages 627–635. JMLR.org, 2011.

[33] W. D. Smart and L. P. Kaelbling. Effective

reinforcement learning for mobile robots. In ICRA,2002.

[34] P. Stone, R. S. Sutton, and G. Kuhlmann.Reinforcement learning for robocup soccer keepaway.Adaptive Behaviour, 13(3):165–188, 2005.

[35] A. L. Strehl and M. L. Littman. An analysis ofmodel-based interval estimation for markov decisionprocesses. Journal of Computer and System Sciences,74(8):1309–1331, 2008.

[36] R. Sutton and A. Barto. Reinforcement learning: anintroduction. Adaptive computation and machinelearning. MIT Press, 1998.

[37] C. Szepesvari. Algorithms for Reinforcement Learning.Synthesis Lectures on Artificial Intelligence andMachine Learning. Morgan & Claypool Publishers,2010.

[38] M. Taylor, H. B. Suay, and S. Chernova. Integratingreinforcement learning with human demonstrations ofvarying ability. In Proc. of the Intl. Conf. on AAMAS,pages 617–624, 2011.

[39] G. Tesauro. Temporal difference learning andtd-gammon. Communications of the ACM,38(3):58–68, Mar. 1995.

exploration from demonstration for interactive ...isbell/papers/efd-aamas-2016.pdf · selection...

Documents