cst: constructing skill trees by demonstration - learning and

8
CST: Constructing Skill Trees by Demonstration George Konidaris [email protected] MIT CSAIL, 32 Vassar Street, Cambridge MA 02139 USA Scott Kuindersma [email protected] Roderic Grupen [email protected] Andrew Barto [email protected] Computer Science Department, University of Massachusetts Amherst, Amherst MA 01003 USA Abstract We describe recent work on CST, an online algorithm for constructing skill trees from demonstration trajectories. CST segments a demonstration trajectory into a chain of component skills, where each skill has a goal and is assigned a suitable abstraction from an abstraction library. These properties per- mit skills to be improved efficiently using a policy learning algorithm. Chains from mul- tiple demonstration trajectories are merged into a skill tree. We describe applications of CST to acquiring skills from human demon- stration in a dynamic continuous domain and from both expert demonstration and learned control sequences on a mobile manipulator. 1. Introduction Learning from demonstration (or LfD) (Argall et al., 2009) offers a natural and intuitive approach to robot programming: rather than investing effort into writing a detailed control program, we simply show the robot how to achieve a task. LfD has received a great deal of attention in recent years because it aims to facilitate ubiquitous general-purpose automation by removing the need for engineering expertise and instead enabling the direct use of existing human procedural knowledge. This paper summarizes recent work on CST, an LfD algorithm with four properties which, taken together, distinguish it from previous work. First, rather than converting a demonstration trajectory into a single controller, CST segments demonstration trajectories into a sequence of controllers (which we term skills, but Appearing in Proceedings of the ICML Workshop on New Developments in Imitation Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). are also called behaviors or motion primitives). This aims to extract reusable components of the demonstra- tor’s behavior. Second, CST extracts skills which have goals —in particular, the objective of skill n is to reach a configuration where skill n +1 can be successfully ex- ecuted. Such skills can be refined by the robot using policy improvement algorithms. Third, CST option- ally supports skill-specific abstraction selection, where each skill policy is defined using only a small number of relevant state and motor variables. This affords ef- ficient representation and learning, facilitates transfer, and enables the acquisition of policies that are high- dimensional when represented monolithically but con- sist of subpolicies that can be individually represented using far fewer state variables. Finally, CST merges skill chains from multiple demonstrations into a skill tree, allowing it to deal with collections of trajectories that use different component skills to achieve the same goal, while also determining which trajectory segments are instances of the same policy. 2. Background This work adopts the options framework—a hierar- chical reinforcement learning formalism for learning and planning using temporally extended actions or op- tions —for modeling acquired skills. An option, o, consists of three components: an option policy, π o , giving the probability of executing each ac- tion in each state in which the option is defined; an ini- tiation set indicator function, I o , which is 1 for states where the option can be executed and 0 elsewhere; and a termination condition, β o , giving the probability of option execution terminating in states where the op- tion is defined. Given an option reward function (of- ten just a cost function with a termination reward), determining the option’s policy can be viewed as just another reinforcement learning problem, and an appro-

Upload: others

Post on 11-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

CST: Constructing Skill Trees by Demonstration

George Konidaris [email protected]

MIT CSAIL, 32 Vassar Street, Cambridge MA 02139 USA

Scott Kuindersma [email protected] Grupen [email protected] Barto [email protected]

Computer Science Department, University of Massachusetts Amherst, Amherst MA 01003 USA

Abstract

We describe recent work on CST, an onlinealgorithm for constructing skill trees fromdemonstration trajectories. CST segmentsa demonstration trajectory into a chain ofcomponent skills, where each skill has a goaland is assigned a suitable abstraction froman abstraction library. These properties per-mit skills to be improved efficiently using apolicy learning algorithm. Chains from mul-tiple demonstration trajectories are mergedinto a skill tree. We describe applications ofCST to acquiring skills from human demon-stration in a dynamic continuous domain andfrom both expert demonstration and learnedcontrol sequences on a mobile manipulator.

1. Introduction

Learning from demonstration (or LfD) (Argall et al.,2009) offers a natural and intuitive approach to robotprogramming: rather than investing effort into writinga detailed control program, we simply show the robothow to achieve a task. LfD has received a great deal ofattention in recent years because it aims to facilitateubiquitous general-purpose automation by removingthe need for engineering expertise and instead enablingthe direct use of existing human procedural knowledge.

This paper summarizes recent work on CST, an LfDalgorithm with four properties which, taken together,distinguish it from previous work. First, rather thanconverting a demonstration trajectory into a singlecontroller, CST segments demonstration trajectoriesinto a sequence of controllers (which we term skills, but

Appearing in Proceedings of the ICML Workshop on NewDevelopments in Imitation Learning, Bellevue, WA, USA,2011. Copyright 2011 by the author(s)/owner(s).

are also called behaviors or motion primitives). Thisaims to extract reusable components of the demonstra-tor’s behavior. Second, CST extracts skills which havegoals—in particular, the objective of skill n is to reacha configuration where skill n+1 can be successfully ex-ecuted. Such skills can be refined by the robot usingpolicy improvement algorithms. Third, CST option-ally supports skill-specific abstraction selection, whereeach skill policy is defined using only a small numberof relevant state and motor variables. This affords ef-ficient representation and learning, facilitates transfer,and enables the acquisition of policies that are high-dimensional when represented monolithically but con-sist of subpolicies that can be individually representedusing far fewer state variables. Finally, CST mergesskill chains from multiple demonstrations into a skilltree, allowing it to deal with collections of trajectoriesthat use different component skills to achieve the samegoal, while also determining which trajectory segmentsare instances of the same policy.

2. Background

This work adopts the options framework—a hierar-chical reinforcement learning formalism for learningand planning using temporally extended actions or op-tions—for modeling acquired skills.

An option, o, consists of three components: an optionpolicy, πo, giving the probability of executing each ac-tion in each state in which the option is defined; an ini-tiation set indicator function, Io, which is 1 for stateswhere the option can be executed and 0 elsewhere; anda termination condition, βo, giving the probability ofoption execution terminating in states where the op-tion is defined. Given an option reward function (of-ten just a cost function with a termination reward),determining the option’s policy can be viewed as justanother reinforcement learning problem, and an appro-

CST: Constructing Skill Trees by Demonstration

priate policy learning algorithm can be applied. Onceacquired, a new option can be added to an agent’s ac-tion repertoire alongside its primitive actions, and theagent chooses when to execute it in the same way.

An option is a useful model for a robot controller:it contains all the information required to determinewhen a controller can be run (its initiation set), whenit is done (its termination condition), and how it per-forms control (its policy). An option reward functionallows us to model controllers that can be improvedthrough experience. In the remainder of this paper,we will use the terms skill and option interchangeably.

CST performs segmentation based on a linear valuefunction approximation. Each option value func-tion, V , is thus approximated by a weighted sumof a given set of basis functions, φ1, ..., φn: V (x) =∑n

i=1 wiφi(x). We use the Fourier basis (Konidariset al., 2011b) throughout this work.

Although value function methods are widely used inreinforcement learning, robotics applications typicallyuse policy gradient algorithms, which represent the pol-icy π directly. Nevertheless, value function approxima-tion is a key step in many policy gradient algorithmssince an approximate value function can be used toobtain a low-variance estimator of the policy gradientwhen π is differentiable (Sutton et al., 2000). Evenwhen it is not, an approximate value function is stilla useful guide to the structure of the policy and of-ten contains richer information for use in segmenta-tion than the policy itself (which is often piecewiseconstant).

We may wish to define a skill policy in a smaller andmore task-relevant state space than the full state spaceof the robot. This is known as an abstraction. In thiswork we define an abstraction M to be a pair of func-tions (σM , τM ), where σM : S → SM is a state abstrac-tion mapping the overall state space S to a smallerstate space SM (often simply a subset of the variablesin S, but potentially a more complex mapping involv-ing significant feature processing), and τM : A → AM

is a motor abstraction mapping the full action space Ato a smaller action space AM (often simply a subset ofA). When using an abstraction, the agent’s sensor in-put is filtered through σM and its policy π maps fromSM to AM . We assume that each abstraction has aset of basis functions, ΦM , defined over SM which wecan use to define a value function. Therefore, using anabstraction amounts to representing the relevant valuefunction using that abstraction’s basis functions.

3. CST

Given a demonstration trajectory, our task is to breakit into component skills. A common principle inrobotics, control and reinforcement learning is thatthe goal of a skill is to reach another skill (Lozano-Perez et al., 1984; Burridge et al., 1999; Tedrake, 2009;Konidaris and Barto, 2009). Thus, we aim to slice thetrajectory into a chain of contiguous skill segments; wewould like to split a segment into two when its valuefunction is too complex to represent as a single seg-ment, or when it is composed of two segments bestrepresented with different abstractions.

Statistical changepoint detection algorithms performjust such a segmentation task. Here, we are given ob-served data and a set of candidate models. We assumethat the data are sequentially generated by an instanceof a single model, occasionally switching between mod-els at certain points in time, called changepoints. Weare to infer the number and positions of the change-points and select and fit an appropriate model for eachsegment. Figure 1 shows a simple example.

5 10 15 20 25 30 35 40 45 50 55 60

0

50

100

150

200

(a)

5 10 15 20 25 30 35 40 45 50 55 60

0

50

100

150

200

(b)

Figure 1. Artificial example data with multiple segments.The observed data (a) are generated by three differentmodels plus noise (b; solid lines, changepoints shown usingdashed lines). The first and third segments are generatedby a linear model, whereas the second is quadratic.

Because our data are received sequentially and possi-bly at a high rate, we would like to perform change-point detection online—processing transitions as theyoccur and then discarding them. Fearnhead and Liu(2007) introduced online algorithms for both Bayesianand maximum a posteriori (MAP) changepoint detec-tion. We use the simpler MAP method.

Their model is as follows. A set, Q, of models is givenwith prior p(q) for each q ∈ Q. Data tuples (xt, yt)are observed for times t ∈ {1, 2, . . . , T}. The marginalprobability of a segment length l is modeled with prob-ability mass function g(l) and cumulative distributionfunction G(l) =

∑li=1 g(i). Finally, a segment from

time j + 1 to t can be fit using model q to obtainP (j, t, q), the probability of the data segment condi-tioned on q. This results in a Hidden Markov Modelwhere the hidden state at time t is the model qt andthe observed data is yt given xt. This model is de-picted in Figure 2. Notice that all of its transition

CST: Constructing Skill Trees by Demonstration

probabilities are known or computed directly from thedata. Rather than attempting to learn the transitionprobabilities of the hidden states, we are instead tryingto compute the maximum likelihood sequence of hid-den states given their transition probabilities and thedata. We can therefore use an online Viterbi algorithmto compute the most likely transition path through thehidden states in this HMM given the data, and therebyour segmentation.

qi qj

yi yi+1 yjyj-1...

i < j, q

P(i, j, qi)(1 - G(j - i - 1))

g(j - i - 1)p(qj)

Figure 2. The Hidden Markov Model for changepoint de-tection. The model qt at each time t is hidden, but pro-duces observable data yt. Transitions occur when themodel changes, either to a new model or the same modelwith different parameters. The transition from modelqi to qj occurs with probability g(j − i − 1)p(qj), whilethe emission probability for observed data yi, ..., yj−1 isP (i, j, qi)(1−G(j− i− 1)). These probabilities are consid-ered for all times i < j and models qi, qj ∈ Q.

Unfortunately, the number of possible paths in aViterbi algorithm grows over time. However, mostchangepoint probabilities will be close to zero. Wecan therefore employ a particle filter to discard mostof them and retain a constant number per timestep.We use the Stratified Optimal Resampling algorithmof Fearnhead and Liu (2007) for this purpose.

CST extracts skills by applying changepoint detectionto the demonstrated trajectory data. It uses the setsof basis functions associated with each abstraction asmodels and the sample return, Rt =

∑ni=t γ

i−tri, fromthe state at each time t as the target variable. This ef-fectively performs changepoint detection on the valuefunction sample obtained from the trajectory. Segmen-tation thus breaks that value function sample up intosimpler segments or detects a change in abstraction.This is depicted in Figure 3.

We assume a geometric distribution for skill lengthswith parameter p, which gives us a natural way to setp via k = 1/p, the expected skill length. Since we areusing linear value function approximation, we use a lin-ear regression model with Gaussian noise as our modelof the data. Following Fearnhead and Liu (2007), weassume conjugate priors: the Gaussian noise prior has

Return

Features Door Key Lock

Figure 3. An illustration of a trajectory segmented intoskills by CST. A robot executes a trajectory where it goesthrough a door, approaches and picks up a key, and thentakes the key to a lock (bottom). The robot is equippedwith three possible abstractions: state variables giving itsdistance to the doorway, the key, and the lock, respectively.The values of these variables change during trajectory ex-ecution (middle) as the distance to each object changeswhile it is in the robot’s field of view. The robot also ob-tains a sample of return for each point along the trajectory(top). CST splits the trajectory into segments by findinga MAP segmentation such that the return estimate is bestrepresented by a piecewise linear value function where eachsegment is defined over a single abstraction. Changepointsare indicated by dashed vertical lines.

mean zero and an inverse gamma variance prior. Notethat we are using each Rt as the target regressionvariable in this formulation, even though we only ob-serve rt for each state. However, the relevant sufficientstatistics can be computed incrementally using rt, andthus the fit probability can be computed online at eachtimestep without storing any transition data.

Given multiple skill chains obtained this way from dif-ferent trajectories, we would like to merge them into askill tree by determining which pairs of trajectory seg-ments belong to the same skills and which are distinct.

Because we wish to build skills that can be sequen-tially executed, we only consider merging two segmentswhen they have the same target—which means thattheir goals are either to reach the initiation set of thesame target skill, or to reach the same final goal. Thismeans that we can consider merging the final segmentof each trajectory, or two segments whose successorsegments have been merged. Thus, two chains aremerged by starting at their final skill segments. Each

CST: Constructing Skill Trees by Demonstration

pair of segments are merged if they are a good statis-tical match. This process is repeated until a pair ofskill segments fail to merge, after which the remainingskill chains branch off on their own. This process isdepicted in Figure 4. A similar process can be usedto merge a chain into an existing tree by following thechain with the highest merge likelihood when a branchin the tree is reached. For more details, see Konidariset al. (2010).

(a) (b) (c) (d)

Figure 4. Merging two skill chains into a skill tree. First,the the final trajectory segment in each of the chains isconsidered (a). If these segments use the same model, over-lap, and can be well represented using the same functionapproximator, they are merged and the second segment ineach chain can be considered (b). This process continuesuntil it encounters a pair of segments that should not bemerged (c). Merging then halts and the remaining skillchains form separate branches of the tree (d).

4. Applications

4.1. The Pinball Domain

The Pinball domain is a difficult continuous domainwith dynamic aspects, sharp discontinuities, and ex-tended control characteristics.1 The goal is to maneu-ver a small ball (which starts in one of two places) intoa large red hole. The ball is dynamic, so its state isdescribed by four variables: x, y, x and y. Collisionswith obstacles are fully elastic and cause the ball tobounce, so rather than merely avoiding obstacles theagent may choose to use them to efficiently reach thehole. There are five primitive actions: incrementing ordecrementing x or y by a small amount (which incurs areward of −5 per action), or leaving them unchanged(which incurs a reward of −1 per action). Reachingthe goal obtains a reward of 10, 000.

Five pairs of demonstration trajectories (one trajec-tory in each pair for each start state) were collectedfrom a human expert. Trajectory segmentation wassuccessful for all demonstration trajectories, and all

1Java source code for Pinball can be downloaded at:http://www-all.cs.umass.edu/~gdk/pinball

pairs were merged successfully into skill trees. Ex-ample segmentations are shown in Figure 5, and theresulting initiation sets (learned using logistic regres-sion, where states inside the skill segment are positiveexamples and all other observed states are negativeexamples) are shown in Figure 6.

Figure 5. Demonstration trajectories segmented into skillchains, and the trajectory assignments obtained when thetwo chains are merged.

Figure 6. The initiation sets for each option in the treeshown in Figure 5.

Figure 7 shows learning curves comparing agents givenskill trees extracted by CST, agents that build skilltrees from scratch using skill chaining (Konidaris andBarto, 2009), and agents given skills pre-learned using250 episodes of skill chaining. The agents given CSTskill trees are able to learn very good policies within10 episodes, by which time they even exceed the per-formance of agents given pre-learned skills. This islikely because agents with pre-learned skills begin withmany skills to learn how to sequence whereas the CSTagents identify only the number required to solve thetask. For more details see Konidaris et al. (2010).

4.2. Acquiring Mobile Manipulation Skillsfrom Human Demonstration

We next show that CST can scale up by applying itto acquire skill chains from human demonstration dataon the uBot-5, a dynamically balancing mobile manip-ulator. The robot’s task in this section is to enter acorridor, approach a door, push the door open, turnright into a new corridor, and finally approach andpush on a panel (illustrated in Figure 8). A humanoperator provided 12 demonstration trajectories.

CST: Constructing Skill Trees by Demonstration

20 40 60 80 100 120 140−18

−16

−14

−12

−10

−8

−6

−4

−2

0

2x 10

4

Episodes

Ret

urn

Pre−learnedSkill ChainingCST

Figure 7. Learning curves in the PinBall domain, for agentsemploying skill trees created from demonstration trajecto-ries, agents using incremental skill chaining, and agentsstarting with pre-learned skills.

(a) (b)

(c) (d)

(e) (f)

Figure 8. The task demonstrated on the uBot-5. Startingat the beginning of a corridor (a), the uBot approaches (b)and pushes open a door (c), turns through the doorway(d), then approaches (e) and pushes a panel (f).

To simplify perception, purple, orange and yellow col-ored circles were placed on the door and panel, begin-ning of the back wall, and middle of the back wall,respectively, as perceptually salient markers. The dis-tances (obtained using onboard stereo vision) betweenthe uBot to each marker were computed at 8Hz andfiltered. The uBot was able to engage one of two mo-tor abstractions at a time: either performing end-point

position control of its hand, or controlling the speedand angle of its forward motion. Using these features,we constructed six sensorimotor abstractions, one foreach pairing of salient object and motor command set.We performed policy regression to fit the segmentedpolicies for replay. Policy replay testing was performedby varying the starting point of the robot by handand used hand-coded stopping conditions that corre-sponded to the initiation set of the subsequent skill.

Of the 12 demonstration trajectories gathered fromthe uBot, 3 had to be discarded because of data lossdue to excess perceptual noise. Of the remaining 9, allsegmented sensibly and 8 were able to be merged intoa single skill chain. Figure 9 shows a segmented tra-jectory obtained using CST, with Table 1 providinga brief description of the skills extracted along withtheir selected abstractions, and the number of sam-ple trajectories required for each skill to be replayedsuccessfully at least 9 times out of 10.

Figure 9. A demonstration trajectory of the uBot-5 per-forming the task in Figure 8, segmented into skills.

# Abstraction Description ExamplesRequired

a torso-purple Drive to door. 2b hand-purple Open the door. 1c torso-orange Drive toward wall. 1d torso-yellow Turn toward panel. 2e torso-purple Drive to the panel. 1f hand-purple Press the panel. 3

Table 1. A brief description of each of the skills extractedfrom the trajectory shown in Figure 9, along with their se-lected abstractions, and the number of example trajectoriesrequired for accurate replay.

CST is thus able to successfully segment trajectoriesdemonstrated on a mobile manipulator and assign the

CST: Constructing Skill Trees by Demonstration

relevant abstractions to each individual skill. Sinceeach skill is defined using only a small number of rel-evant task variables, it requires a very low number ofdemonstration trajectories to achieve reliable replay.For details see Konidaris et al. (2010).

A similar experiment used CST in combination withmodel-based control methods for obtaining the skillpolicies. Here, the skill goals identified by CST (asmall region around the first few states of the successorskill) were used as targets for closed-loop controllers.By segmenting the demonstrated trajectory into sub-tasks, we were able to achieve reliable replay usingjust a single demonstration with a simple control al-gorithm that would have been difficult or impossibleto apply monolithically. For details see Kuindersmaet al. (2010).

4.3. Acquiring Mobile Manipulation Skillsfrom a Learned Policy

We now describe a robot system that produces its owndemonstration trajectories by learning to sequence ex-isting controllers and extracts skills from the resultinglearned solution. In the Red Room task, the uBot-5 isplaced in a small room containing a button and a han-dle. When the handle is pulled after the button hasbeen pressed, a door in the side of the room opens, al-lowing the robot access to a compartment which con-tains a switch. The goal of the task is to press theswitch. A schematic and photographs of the domainare given in Figure 10.

Button

Handle

SwitchDoor

245 cm355 cm

Start

Figure 10. The Red Room Domain.

The robot is given a fixed set of innate controllers fornavigating to visible objects of interest and for inter-acting with them using its end-effector (by extendingits arm and moving it to the right, left, up, down,or forwards). In order to actuate the button and theswitch, the robot must extend its arm and then moveit outwards; in order to actuate the handle, it must ex-tend its arm and then move it downwards. The robot

constructed a transition model of the domain throughinteraction and used it to compute a policy using dy-namic programming. The robot was able to find theoptimal solution after five episodes.

The resulting optimal sequence of controllers was thenused to generate 5 demonstration trajectories for usein CST. The resulting trajectories all segmented intothe same sequence of 10 skills, and were all mergedsuccessfully. An example segmentation is shown inFigure 11; a description of each skill along with itsrelevant abstraction is given in Table 2.

Figure 11. A trajectory from the learned solution to theRed Room task, segmented into skills.

# Abstraction Description

A torso-button Align with the button.B torso-button Approach the button.C hand-button Push the button.D torso-handle Align with the handle.E torso-handle Approach the handle.F hand-handle Pull the handle.G torso-entrance Align with the entrance.H torso-entrance Drive through the entrance.I torso-switch Approach the switch.J hand-switch Press the switch.

Table 2. A brief description of each of the skills extractedfrom the trajectory shown in Figure 11, along with theirselected abstractions.

CST consistently extracted skills that correspondedto manipulating objects in the environment, and nav-igating towards them. In the navigation case, eachcontroller execution was split into two separate skills.These skills correspond exactly to the two phases of thenavigation controller: first, aligning the robot with thenormal of the target object, and second, moving therobot toward that feature. In the object-manipulationcase, a sequence of two controllers is collapsed into asingle skill: for example, extending the arm and then

CST: Constructing Skill Trees by Demonstration

extending it further toward the object of interest is col-lapsed into a single skill which we might label push thebutton. We constructed closed-loop manipulation poli-cies by fitting splines over relative spatial waypointsand obtained reliable replay using a single demonstra-tion trajectory for each. Once these skills were addedto the robot’s control system, it was able to deploythem in a second problem to solve it faster than itcould using just its innate controllers. For details seeKonidaris et al. (2011a).

5. Related Work

A great deal of work exists under the general headingof LfD (surveyed by Argall et al. (2009)). Most meth-ods learn an entire policy monolithically from data,although some perform segmentation to extract skills.

The approach most closely related to CST is by Dixonand Khosla (2004a), where a demonstration trajectoryis segmented into a sequence of linear dynamical sys-tems. Each linear dynamical system is used to derive aconvergent controller, with a small region around thefinal state considered its goal. The algorithm can berun online and was used in conjunction with severalother methods to build a mobile robot system thatperformed LfD by tracking a human user (Dixon andKhosla, 2004b). This method differs from CST in threeways: it does not use skill-specific abstractions; it seg-ments demonstration trajectories into policies that arelinear in the robot’s state variables, which is a strongercondition than a value function that is linear in a set ofbasis functions; and finally, it uses a heuristic method(an error metric exceeding a threshold parameter) forsegmentation.

Another closely related LfD framework is thePerformance-Derived Behavior Vocabularies (PDBV)framework (Jenkins and Mataric, 2004), which seg-ments demonstrated data into motion primitives andclusters them to build a motion primitive library. CSTdiffers from PDBV in four major aspects. First, PDBVis a batch method. Second, PDBV discovers a low-dimensional representation (a motion-manifold) foreach primitive, rather than using an abstraction li-brary. Third, PDBV extracts motion policies directly,which are not amenable to automatic improvement be-cause they do not have goals. Finally, PDBV performssegmentation using Kinematic Centroid Segmentation,a heuristic specific to human-like motions. More recentwork has used computationally expensive but princi-pled statistical methods (Grollman and Jenkins, 2010;Butterfield et al., 2010) to segment the data as a wayto avoid perceptual aliasing in the policy.

Chiappa et al. (2009) and Chiappa and Peters (2010)

described a principled statistical approach to acquir-ing motion primitive libraries from data, where thedemonstrated trajectories are modeled as Bayesian lin-ear Gaussian state space models. This approach au-tomatically segments the demonstrated data into poli-cies, and it can handle multivariate target variablesand models that repeat within a single trajectory.Although these systems have achieved impressive re-sults on real robots, they use computationally inten-sive batch processing, do not use skill-specific abstrac-tions, and do not result in skills with goals.

6. Discussion and Conclusion

The availability of a suitable abstraction library iskey to the application of CST in high-dimensionaldomains. This could potentially require significant(though in principle once-off) design effort to create,program and debug a set of abstractions suitable forrepresenting anything the robot may decide to learn.However, we expect that a small library consisting ofpairings of motor abstractions and one or two visibleobjects will often be sufficient. If necessary, abstrac-tion selection could be paired with a feature selectionmethod to augment the selected abstraction duringpolicy learning. Future work may also consider ap-proaches to acquiring the abstraction library from dataor finding feasible ways to build skill-specific abstrac-tions at runtime.

An important assumption made by CST is that eachskill segmentation should form a chain and the mergedchains should form a tree. In some cases, however,they may form a more general graph (e.g., when thedemonstrated policy has a loop). The procedure tomerge skill chains could be generalized to accommo-date such cases, and applied to merging skills bothwithin an individual chain and across chains.

CST is agnostic to the method use to learn each in-dividual skill policy from data. The experiments re-ported here used four different methods: value func-tion regression in Pinball, both regression on motoroutput and model-based planning for acquiring poli-cies from a human expert on the uBot, and a spline-based closed-loop controller for acquiring skills fromlearned controller sequences on the uBot. For robustand reliable robot control from demonstration datawe expect that either a stabilized, trajectory-followingcontroller or a dynamic movement primitive (Schaal,2003) controller will provide a good balance betweenlearnability and efficiency.

CST aims to extract transferrable policy componentsthat the robot can retain, refine, and reuse. This isaccomplished through a principled approach to learn-

CST: Constructing Skill Trees by Demonstration

ing from multiple unsegmented trajectories, and theuse of skill-specific abstractions which enable efficientrepresentation and facilitate transfer. These advan-tages offer a promising avenue of development towardgeneral-purpose learning from demonstration.

Acknowledgments

We would like to thank the members of the LPR fortheir technical assistance. Andrew Barto and GeorgeKonidaris were supported in part by the AFOSR un-der grant FA9550-08-1-0418. George Konidaris wasalso supported in part by the AFOSR under grantAOARD-104135 and the Singapore Ministry of Educa-tion under a grant to the Singapore-MIT InternationalDesign Center. Scott Kuindersma is supported by aNASA GSRP fellowship from Johnson Space Center.Rod Grupen was supported by the Office of Naval Re-search under MURI award N00014-07-1-0749.

ReferencesB. Argall, S. Chernova, M. Veloso, and B. Browning. A

survey of robot learning from demonstration. Roboticsand Autonomous Systems, 57:469–483, 2009.

R.R. Burridge, A.A. Rizzi, and D.E. Koditschek. Sequen-tial composition of dynamically dextrous robot behav-iors. International Journal of Robotics Research, 18(6):534–555, 1999.

J. Butterfield, S. Osentoski, G. Jay, and O.C. Jenkins.Learning from demonstration using a multi-valued func-tion regressor for time-series data. In Proceedings ofthe Tenth IEEE-RAS International Conference on Hu-manoid Robots, 2010.

S. Chiappa and J. Peters. Movement extraction by detect-ing dynamics switches and repetitions. In Advances inNeural Information Processing Systems 23, pages 388–396, 2010.

S. Chiappa, J. Kober, and J. Peters. Using Bayesian dy-namical systems for motion template libraries. In Ad-vances in Neural Information Processing Systems 21,pages 297–304, 2009.

K.R. Dixon and P.K. Khosla. Trajectory representationusing sequenced linear dynamical systems. In Proceed-ings of the IEEE International Conference on Roboticsand Automation, pages 3925–3930, 2004a.

K.R. Dixon and P.K. Khosla. Learning by observation withmobile robots: a computational approach. In Proceed-ings of the IEEE International Conference on Roboticsand Automation, pages 102–107, 2004b.

P. Fearnhead and Z. Liu. On-line inference for multiplechangepoint problems. Journal of the Royal StatisticalSociety B, 69:589–605, 2007.

D.H. Grollman and O.C. Jenkins. Incremental learningof subtasks from unsegmented demonstration. In Inter-national Conference on Intelligent Robots and Systems,2010.

O.C. Jenkins and M. Mataric. Performance-derived behav-ior vocabularies: data-driven acquisition of skills frommotion. International Journal of Humanoid Robotics, 1(2):237–288, 2004.

G.D. Konidaris and A.G. Barto. Skill discovery in continu-ous reinforcement learning domains using skill chaining.In Advances in Neural Information Processing Systems22, pages 1015–1023, 2009.

G.D. Konidaris, S.R. Kuindersma, A.G. Barto, and R.A.Grupen. Constructing skill trees for reinforcement learn-ing agents from demonstration trajectories. In J. Laf-ferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel,and A. Culotta, editors, Advances in Neural Informa-tion Processing Systems 23, pages 1162–1170, 2010.

G.D. Konidaris, S.R. Kuindersma, R.A. Grupen, and A.G.Barto. Autonomous skill acquisition on a mobile manip-ulator. In Proceedings of the Twenty-Fifth Conferenceon Artificial Intelligence, 2011a.

G.D. Konidaris, S. Osentoski, and P.S. Thomas. Valuefunction approximation in reinforcement learning usingthe Fourier basis. In Proceedings of the Twenty-FifthConference on Artificial Intelligence, 2011b.

S.R. Kuindersma, G.D. Konidaris, R.A. Grupen, and A.G.Barto. Learning from a single demonstration: Motionplanning with skill segmentation. In Proceedings of theNIPS Workshop on Learning and Planning from BatchTime Series Data, December 2010.

L-J Lin. Programming robots using reinforcement learningand teaching. In Proceedings of the Ninth National con-ference on Artificial Intelligence, pages 781–786, 1991.

T. Lozano-Perez, M.T. Mason, and R.H. Taylor. Auto-matic synthesis of fine-motion strategies for robots. TheInternational Journal of Robotics Research, 3(1):3–24,1984.

F.G. Martin. The Handy Board Technical Reference. MITMedia Lab, Cambridge MA, 1998.

S. Schaal. Dynamic movement primitives - a frameworkfor motor control in humans and humanoid robots. InProceedings of the international symposium on adaptivemotion of animals and machines, 2003.

R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour.Policy gradient methods for reinforcement learning withfunction approximation. In Advances in Neural Infor-mation Processing Systems 12, pages 1057–1063, 2000.

R. Tedrake. LQR-Trees: Feedback motion planning onsparse randomized trees. In Proceedings of Robotics: Sci-ence and Systems, pages 18–24, 2009.