hierarchical hybrid-reality simulator for practical ...cga/sony1/description-original.pdf · as a...

10
Hierarchical Hybrid-Reality Simulator for Practical Reinforcement Learning PI: Christopher G. Atkeson ([email protected]), Co-PI: Akihiko Yamaguchi 1 Abstract In order to develop a practical reinforcement learning (RL) method for robots, we propose a hierarchical hybrid- reality simulator and reasoning methods on it. In this model-based RL approach, we use many important ideas from relative fields. One is a skill library with which complicated tasks and dynamics are decomposed into sub-skills and component dynamical systems. Each dynamical system corresponds with a sub-skill (or primitive action), and estimates the output state from the input. This idea is referred to as (sub)task-level dynamical systems, which dramatically helps model-based RL to avoid the issue simulation-biases (the error in time integrals of learned models increases rapidly). Another idea is putting an importance on shareability (or reusability) of component dynamic models for efficiency in learning many different tasks. If analytical models are available such as geometry models, we use them. We use learning models only when they are necessary. The entire concept of the hierarchical hybrid-reality simulator is illustrated in Figure 1. As the consequence, the proposed RL method would be a practical robotic tool. We explore it with challenging robot tasks, such as pouring liquids and powders, cutting food, and cooking. 2 Introduction Reinforcement learning (RL) for practical domains is challenging and important. Robotics technologies succeed in rigid body control and rigid object manipulation, while manipulation of deformable objects is still a difficult problem. Manipulation of non-rigid objects has wide applications in home-care and hazardous environments. RL is useful in these scenarios since they involve dynamical systems that we do not have accurate analytical models. Currently RL methods are far from practical applications. The following challenges are necessary for breaking this boundary. Efficiency of Learning Behaviors Across Many Tasks: Unlike many other reinforcement learning and robot learning research, we consider an RL method that learns behaviors of many different tasks. This is more practical since for example home-care robots are engaged in multiple tasks. It may require theoretical extensions of RL since we need to consider the efficiency of RL in learning multiple tasks. Many of RL ap- plications in robotics optimize a policy only in a specific situation (e.g. [25, 24]). Such research focused on adaptation of robots, i.e. improving a policy for an unseen situation. For covering many tasks, the generaliza- tion of each learned policy is also important to improve the learning efficiency. The generalization includes shareability of learned components, such as primitive policies and local dynamic models. Learning Behaviors Involving Sub-skills: We consider tasks consisting of sub-skills (primitives). For example a pouring task involves grasping a container, opening it, moving it to the location of another container, and pouring the material. Many of practical tasks are in this type, but they are difficult for RL since such tasks involve physical mode changes. For example robots can move an object only when grasping it. In pouring, the material does not come out of the container if its lid is closed. Learning Deformable Object Manipulation: Deformable object manipulation is a challenging re- search field in robotics, such as cutting, pouring, and cooking. Solving such problems is a good domain of RL. They involve different skills to achieve the task, and some of them are common among different tasks. Thus these tasks are examples including above challenges. Learning Concept: A goal of cooking is to make tasty food. While a robot can measure salinity, brix, and acidity of food, only humans rate the taste. If robots learn the relation between the rate of taste and sensing data (salinity, brix, acidity, how the food looks like, amount of seasoning used to cook, etc.), which is a kind of dynamical model, robots can reason about cooking strategy to make tasty food. This is an RL problem with a high-level goal (rate of taste as rewards). The capability to deal with high-level goals increases the usefulness of RL. 1

Upload: others

Post on 24-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

Hierarchical Hybrid-Reality Simulator for Practical Reinforcement LearningPI: Christopher G. Atkeson ([email protected]), Co-PI: Akihiko Yamaguchi

1 Abstract

In order to develop a practical reinforcement learning (RL) method for robots, we propose a hierarchical hybrid-reality simulator and reasoning methods on it. In this model-based RL approach, we use many importantideas from relative fields. One is a skill library with which complicated tasks and dynamics are decomposedinto sub-skills and component dynamical systems. Each dynamical system corresponds with a sub-skill (orprimitive action), and estimates the output state from the input. This idea is referred to as (sub)task-leveldynamical systems, which dramatically helps model-based RL to avoid the issue simulation-biases (the error intime integrals of learned models increases rapidly). Another idea is putting an importance on shareability (orreusability) of component dynamic models for efficiency in learning many different tasks. If analytical modelsare available such as geometry models, we use them. We use learning models only when they are necessary.The entire concept of the hierarchical hybrid-reality simulator is illustrated in Figure 1. As the consequence,the proposed RL method would be a practical robotic tool. We explore it with challenging robot tasks, such aspouring liquids and powders, cutting food, and cooking.

2 Introduction

Reinforcement learning (RL) for practical domains is challenging and important. Robotics technologies succeedin rigid body control and rigid object manipulation, while manipulation of deformable objects is still a difficultproblem. Manipulation of non-rigid objects has wide applications in home-care and hazardous environments.RL is useful in these scenarios since they involve dynamical systems that we do not have accurate analyticalmodels. Currently RL methods are far from practical applications. The following challenges are necessary forbreaking this boundary.

Efficiency of Learning Behaviors Across Many Tasks: Unlike many other reinforcement learningand robot learning research, we consider an RL method that learns behaviors of many different tasks. This ismore practical since for example home-care robots are engaged in multiple tasks. It may require theoreticalextensions of RL since we need to consider the efficiency of RL in learning multiple tasks. Many of RL ap-plications in robotics optimize a policy only in a specific situation (e.g. [25, 24]). Such research focused onadaptation of robots, i.e. improving a policy for an unseen situation. For covering many tasks, the generaliza-tion of each learned policy is also important to improve the learning efficiency. The generalization includesshareability of learned components, such as primitive policies and local dynamic models.

Learning Behaviors Involving Sub-skills: We consider tasks consisting of sub-skills (primitives). Forexample a pouring task involves grasping a container, opening it, moving it to the location of another container,and pouring the material. Many of practical tasks are in this type, but they are difficult for RL since such tasksinvolve physical mode changes. For example robots can move an object only when grasping it. In pouring, thematerial does not come out of the container if its lid is closed.

Learning Deformable Object Manipulation: Deformable object manipulation is a challenging re-search field in robotics, such as cutting, pouring, and cooking. Solving such problems is a good domain of RL.They involve different skills to achieve the task, and some of them are common among different tasks. Thusthese tasks are examples including above challenges.

Learning Concept: A goal of cooking is to make tasty food. While a robot can measure salinity, brix,and acidity of food, only humans rate the taste. If robots learn the relation between the rate of taste and sensingdata (salinity, brix, acidity, how the food looks like, amount of seasoning used to cook, etc.), which is a kind ofdynamical model, robots can reason about cooking strategy to make tasty food. This is an RL problem with ahigh-level goal (rate of taste as rewards). The capability to deal with high-level goals increases the usefulnessof RL.

1

Page 2: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

Figure 1: Conceptual illustration of hierarchical hybrid-reality simulator. It consists of a skill library (symbolicrepresentations of tasks) and a dynamical-model server. Each dynamical model may be an analytical model ora learned model such as neural networks.

Overview of Approach: Our central idea for above challenges is creating a hierarchical hybrid-realitysimulator. We also develop planning methods over these dynamical models. Thus our approach is model-basedRL. Our simulator is a big simulator where potentially every dynamical system in the world is modeled. Itincludes not only rigid body dynamics but also deformable object dynamics, such as liquids, powders, vegeta-bles, and so on. It includes relations between sensing data and human concepts, for example taste and foodproperty (e.g. salinity, brix, and acidity). It includes failure models as probabilistic bifurcations. It consists ofmany component dynamical systems. It is a hybrid-reality simulator where we consider two types of componentmodels: analytical and learned models. When we do not have good models, which is usual in deformable objectmanipulation, we learn models from practice.

From a technical point of view, making such a simulator precisely and accurately is difficult. We propose tomodel (sub)task-level dynamic systems that are tightly connected with skills (sub-tasks, or primitive actions).They form graph structures of dynamical systems. For example in pouring, first we consider a decompositionof the entire task into sub-skills. Then we train dynamical models through practice that learn mapping frominput state and action parameters to output state. Our simulator has a hierarchical structure. We refer to it as ahierarchical hybrid-reality simulator.

Why Our Approach Works in Above Challenges?: Although there are many transfer learning work(e.g. [75, 35, 66, 18]), reusing policies among tasks typically more difficult than reusing dynamical models.This is because policies are result of planning entire tasks while dynamical models represent local phenomena.Thus our model-based RL approach contribute to efficiency of learning behaviors across different tasks.

We consider a skill library to introduce the hierarchical structure into the simulator. It can naturally representtasks with sub-skills. As an RL method, we provide a way to handle hard-to-model dynamical systems includingdeformable object manipulation. Similarly, our system can learn relationships between concepts and sensingdata.

Preliminary Work: This video of a PR2 robot pouring is a good introduction to our work [71]: https://youtu.be/GjwfbOur3CQ . From this case study of pouring, we obtained key ideas of practical robotlearning: a skill library is useful to deal with the variations of tasks, and planning motions with geometrymodels is a successful approach in robotics. We researched the framework of learning decomposed (subtask-level) dynamic systems and planning actions in [69]. We researched a stochastic extension of neural networksin [70] that is useful to model dynamical systems. With these results, this research proposal is feasible to PI.

2

Page 3: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

3 Intellectual Merit

• A practical reinforcement learning method is developed that is capable to learn complicated robot taskslike deformable object manipulation.

• As a model-based RL, our research is superior to the state-of-the-art (e.g. [32]) because of the (sub)task-level dynamic models.

• Compared to direct policy search which is a most successful robot learning approach (cf. [23]), our ap-proach has better generalization ability, reusability and shareability, and robustness to reward changes.We contribute to reduce the simulation bias issue [23].

• As an approach of planning with physics simulators, our method provides non-rigid object models thatare hard to handle even on the state-of-the-art simulators such as MuJoCo [65].

• As a deep reinforcement learning research, we present a practical use of deep neural networks as learningmodels in a part of the simulator.

• We also contribute to robotics as a research on deformable object manipulation.

The further discussion is made in Section 6.

4 Technical Ideas

Our central idea is creating a hierarchical hybrid-reality simulator. We aim to model every dynamical systemin the world, which includes relationships between robot actions and physical or mental phenomena. We alsodevelop planning methods over these dynamical models in order to reason about robot policies.

A most difficult issue to realize such a simulator is the accuracy and computation time. Since our goal iscreating dynamical models for robots to reason about actions, we do not intend to create perfectly accurate andprecise models or models for real computer graphics. Instead, we create abstract or bypass models rather thandetailed dynamical models. For example in a pouring behavior, we assume a structure: grasping a container,opening the lid, moving it to a receiver position, and making flow. Another remark is that an accurate andprecise model of flow would not be necessary; humans may not have a model to estimate the movement of eachdroplet. An abstracted model, such as center of flow, is enough for reasoning. Thus we make our simulatorwith such abstracted or bypassed models that are referred to as (sub)task-level dynamic models. We considerprimitive actions (sub-skills or sub-tasks) and create dynamical models that estimate the outcome states frominput states and action parameters. This idea is based on our previous work, the task-level robot learning [2, 3].The detailed ideas are described below.

Building a Skill Library: A library of skills is tightly connected with the simulator since the (sub)skillsgive hierarchical structures and decompositions into the dynamical systems. We consider a graph structure torepresent a skill. A node corresponds with a state, and an edge corresponds with a primitive action or anotherskill. We use bifurcations to represent a selection from different actions (e.g. {tipping, shaking, squeezing}) andclasses of outcomes (e.g. {grasped, not grasped}).

An easy way to construct a skill library is implementing skills manually from human demonstrations orknowledge. A more sophisticated way is using a machine-learning tool to segment demonstrated motions byhumans into finite state representations (e.g. [44]).

Learning Task-level Dynamic Systems: We make another library of dynamical models of (sub)skills.A basic strategy is using a regression model that estimates an output state from an input state and action parame-ters for each skill, which we refer to as a component dynamical model. We use a classifier to model a bifurcation.Moreover, we also consider passive dynamical models. For example in pouring, a process of flow going intoa receiver is passive dynamics since a robot can do nothing. Although we can model active dynamics only bybypassing the passive dynamics, modeling passive dynamics sometimes provides more accurate estimates [69].

Such task-level dynamic models are useful to deal with the simulation-bias issue of model-based RL ap-proach (cf. [23]). In reasoning about a policy, we temporally integrate the dynamical models to estimate futurestates and rewards. When using learned models, the estimation error increases rapidly during integration. With

3

Page 4: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

(sub)task-level dynamic models, the number of integral steps is dramatically reduced, which contributes to theerror reduction.

Other benefits of learning (sub)task-level dynamic models are: (A) Generalization: In many cases, a model-based approach generalizes well (e.g. [36]). (B) Reusability: Learned models can be commonly used in differenttasks. (C) Robustness to reward changes: Even when we modify the reward function, we can plan a new policywith learned models without new physical practice.

Stochastic Modeling: Even task-level dynamical models might have modeling error. There would besensing noise. In order to estimate effects of simulation biases, we use a stochastic model for learning dynamicalsystems, such as locally weighted regression [7]. With a stochastic model, we estimate probability propagationalong the component dynamical models on a graph structure. This enables a planning method to take intoaccount the simulation biases. Specifically we use an expectation of rewards as an objective function computedfrom the probability distributions.

Deep Neural Networks for Learning Models: Among many regression and classification methods, weconsider to use neural networks for modeling dynamical systems and bifurcations. The deep learning researchhas succeeded in wide area such as image classification [26], object detection [19], pose estimation [67], playingthe game of Go [55] and so on. A reason is it can automatically extract important features from data, thus aproblem to find good features is relaxed. This is useful in our framework. Since we use stochastic modeling asmentioned above, we need to extend neural networks to be capable of stochastic calculations.

Stochastic Differential Dynamic Programming to Plan Continuous Parameters: We use astochastic differential dynamic programming (DDP; [38]) to plan continuous parameters of actions. A mostbasic structure of the hierarchical hybrid-reality simulator is a linear structure. DDP is applicable to planningparameters of linear-structured dynamical systems. We use a stochastic version of DDP (e.g. [46]) according tothe stochastic modeling.

Extension of DDP for Graph-structured Dynamic Systems: For general graph structures of thehierarchical hybrid-reality simulator, we extend DDP based on graph theory and algorithms for N-armed banditproblem [62]. A graph structure is transformed into a tree structure. We extend DDP for tree structures toplan continuous parameters, and use an algorithm for N-armed bandit problem for planning selection (discrete)parameters.

Multimodal Dynamic Systems: A major advantage of the hybrid-reality simulator is that we can dealwith any types of sensing values if they are consistent. It is suite for multimodal dynamic systems. We arethinking a range of sensors including vision, tactile, sound, salinity, brix, and acidity.

Hybrid of Model-free and Model-based RL: Although we solve the simulation-bias issue of model-based RL as described above, model-free RL still has advantages in final performance (i.e. fine-tuning is better)and computational cost at the execution (cf. [23]). We consider to introduce a hybrid approach of model-freeand based (e.g. [59, 61]). This will increase the practicalness of our method.

Learning from Human Demonstrations: We consider learning from human demonstrations (LfD) inhierarchical (symbolic) level as well as control command level. The structure of the hierarchical hybrid-realitysimulator is made from the knowledge base made by humans. Automating this process increases the usabilityof our method. For this purpose, we can use segmentation methods developed in LfD research (e.g. [44]). Fur-thermore we propose to use our simulator in LfD. For example, dynamical models would be useful to estimateintension of demonstrated tasks.

5 Proposed Research

The central questions of this research are: How do we model every dynamical system in the world, includingdeformable objects and human concepts? How do we reason about robot behaviors using the hierarchicalhybrid-reality simulator? In this context, we propose following research.

4

Page 5: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

5.1 Proposed Example Tasks

We assume home-care robots supporting humans in household activities. At the beginning stage, we focus onfundamental tasks, especially deformable object manipulation. Examples are: (1) Grasping objects such asrigid and soft containers, food, and clothes. (2) Pouring liquids, powders, and particles. (3) Cutting food suchas vegetables, fruits, and meats. (4) Mixing materials such as pancake mix and milk, and seasoning and soup.They are still open problems in robotics since they involve challenges mentioned in the introduction section.Concretely these tasks have following features: (A) Each of them consists of sub-skills. (B) There are commonsub-tasks, such as grasping and moving an object. (C) They include deformable object manipulation. We showgeneralization and adaptation abilities, and learning efficiency of our method in learning these tasks. If wecan show the usefulness of our method in these tasks, we would be able to say our reinforcement learning ispractical.

We also explore cooking simple foods with combining above skills. There are research on robot cooking,such as making pancakes [31] and baking cookies [12]. Although they have successful results, their roboticbehaviors do not generalize widely. We explore how our approach increases the generalization ability. Anotherinteresting point of cooking tasks is that they have dynamics between middle stages of cooking where the robotobtains salinity sensing and so on, and rating of the food (e.g. taste rating by humans).

5.2 Proposed Work: DDP for Graph-Structured Dynamical Systems

We develop a planning algorithm that is capable to reason about discrete selections and continuous parameters ofthe skills in the hierarchical hybrid-reality simulator. We propose to extend differential dynamic programming(DDP; [38]) to be applicable to our dynamical systems. The existing DDP algorithms are applicable to linear-structured dynamical systems.

We start from an existing algorithm of stochastic DDP for linear-structured dynamical systems (e.g. [46]). Achallenge is that there would be many local maxima since we use learned dynamical models. Such local maximawould trap DDP since it is a gradient method. We will work on developing a method to avoid this. Then weextend the DDP for graph-structured dynamical systems. The idea is that since DDP consists of forward andbackward propagating calculations, it is clearly possible to extend it for tree structures. We use graph theoryto transform a graph structure to a tree structure, and apply the extended DDP. In order to deal with discreteselections, we refer to methods for N-armed bandit problem [62].

These algorithms will be verified in the proposed tasks. We explore if the robot can handle different typesof dynamics where different strategies are necessary. For example in pouring water and ketchup, a tipping anda shaking skills would be used respectively.

5.3 Proposed Work: Modeling Hierarchical Dynamic Systems

The core method for making a hierarchical hybrid-reality simulator is proposed. We explore a modeling methodfor complicated tasks such as pouring and cutting food. These tasks consist of different sub-skills. We create askill library, and dynamical models of sub-skills. It forms a task-level (hierarchical) dynamic systems.

If we do not have analytical models of sub-tasks, we learn models through execution. Specifically we pro-pose to use neural networks. As mentioned above we use stochastic models of dynamical systems for dealingwith simulation biases. We extend neural networks to be capable of: (1) modeling prediction error and outputnoise, (2) computing an output probability distribution for a given input distribution, and (3) computing gradi-ents of output expectation with respect to an input. Since neural networks have nonlinear activation functions(e.g. rectified linear units; ReLU), these extensions are not trivial. We analytically solve these issues.

We verify the idea in the grasping, pouring, and cutting tasks. As the advanced research, we explore toshare learned models among multiple tasks and robots. For example grasping skills are used in different tasks.Although a grasping policy is a task or situation dependent (e.g. grasping for pouring and grasping for loadingdishes into dish washer are different), a grasping dynamical model would be easier to be shared among tasks.This is because policies are the result of planning over a task, while dynamical models are local relationships.Among robots, there would also be shareable dynamical models. For example in pouring, after material comes

5

Page 6: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

out of the source container and flow happens, the dynamical model between the flow features (flow position,flow variance) and the amount poured into the receiver or spilled onto a table does not depend on a robot.

The task-level dynamic models we described so far are forming a hierarchy of two layers (primitive actions/sub-skills and a higher skill). For more complicated tasks such as cooking, we consider a hierarchy of more layers.Such research is found as hierarchical RL (e.g. [60, 9, 16]). Most of these algorithms are for symbolic worlds,i.e. they are not capable of performing practical robotic tasks. In our work, we explore continuous variableseven in higher levels. For example, a pouring skill will have continuous parameters like a target amount, andthey are reasoned in a cooking task.

The proposed simulator is decomposing dynamics with primitive actions or skills. If we change the defini-tion of primitive actions or skills, the corresponding dynamical models would become inaccurate. In order tosolve such an issue, we store samples of all past execution for refining the dynamical models. Such reusing sam-ples is not straightforward since changing an action will cause different results. We refer to research on transferlearning [35, 66]; especially importance sampling (e.g. [68]) is a useful method in this context. Although ourdynamical models are sub-task level, we store samples during executing the sub-task. These samples might beuseful in transferring the models.

5.4 Proposed Work: Modeling Multimodal Perception

We explore dynamical models with multimodal perception. For example 3D vision and tactile sensing of objectsfor a grasping task, vision and sound of flow for a pouring task, and so on. Since we use learning methods formodeling dynamical systems, using these information as input or output variables is a natural extension. Fur-thermore, we explore chemical sensors such as salinity, brix, and acidity. The robot actions such as pouring saltaffect these sensing values, where will be consistent relations. There will also be a certain relationship betweenthese sensing values and taste rating by humans. Our dynamical models can represent different types of thehuman rating, such as a score on the scale 1 to 10, and labels like good/bad and salty/sweet, with a graphicalstructure including numerical dynamic models. Thus our approach is capable of modeling such humans con-cepts. Using these models, robots will reason about actions to cook tasty food. This would be an interestingsubtopic of the hierarchical hybrid-reality simulator.

5.5 Proposed Work: Hybrid of Model-free and Model-based RL

This purpose is reducing the disadvantages of the model-based RL by introducing model-free approach. Weaim to increase the final performance and reduce the computational cost at the execution. Specifically we referto direct policy search (e.g. [64, 24]). As well as the dynamical models, we maintain policies that map inputstates to actions. There are at least two choices on how to use the samples in learning: using samples to trainthe dynamical models only (the policies are optimized with the dynamical models), and using samples to trainboth the dynamical models and policies. A well-known architecture is Dyna [58]. Originally it was developedfor discrete state-action domains, and later a new version with a linear function approximator was developedfor continuous domains [61]. Recently Levine et al. [34] proposed a more practical approach where a trajectoryoptimization method for unknown dynamics is combined with a local linear model learned from samples. Whilethese methods were developed for continuous or discrete state-action domains, methods for hierarchical dynamicsystems have not been researched. Developing a combined method of model-free and based approaches for thehierarchical hybrid-reality simulator will contribute to this research field.

5.6 Proposed Work: Learning Hierarchical Representations From Human Demonstrations

A practical benefit of researching on learning from human demonstrations (LfD) is the automation of buildingthe skill library and the (graph) structures of task-level dynamic systems. Many existing LfD research (e.g. [10,25, 11]) would be useful tools for this purpose. Especially segmentation methods of motions (e.g. [44]) suitefor task-level modeling.

We contribute to this field by introducing the hierarchical hybrid-reality simulator. Its dynamical modelswill help robots to understand the intension of human demonstrations even if a skill is new to the robots. For

6

Page 7: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

example when the human is demonstrating a squeezing skill to a robot, the robot can guess that the skill is forpouring since the phenomenon matches with the high-level dynamic model of pouring. Then the robot willsearch new parts on an existing pouring skill structure.

6 Related Work

Although our main focus is model-based reinforcement learning (RL), our research varies over a range of re-search fields including physics simulation, artificial intelligence, optimal control, learning from demonstration,and deformable object manipulation. Each of them has its own boundary. Making a unified theory is necessaryfor learning practical robot tasks. Our research aims to establish such a unification based on model-based RLwork. In the following, we discuss how our approach breaks the boundary of each field.

Reinforcement Learning: Although a current popular approach of RL in robot learning is a model-freeapproach (cf. [23], e.g. [25, 24, 72]), especially direct policy search (e.g. [64, 24]), there are many reasonsto use model-based RL. In model-based RL, we learn a dynamical model of the system, and apply dynamicprogramming such as DDP (e.g. [53, 42, 32]). The advantages of the model-based approach are: (A) Gener-alization ability: in some cases, a model-based approach generalizes well (e.g. [36]). (B) Reusability: learnedmodels can be commonly used in different tasks. (C) Robustness to reward changes: even when we modify thereward function, we can plan a new policy with learned models without new physical practice. A major issueof the model-based approach is so-called simulation biases [23]; the modeling error accumulates rapidly. In ourapproach, we learn (sub)task-level dynamic models according to the idea of task-level robot learning [2, 3]. Welearn the input and output relation of a subtask which often does not require time integrals in forward estimation.In addition, we use probabilistic representations for dealing with modeling errors.

Thus, our approach will solve the issues of model-based RL. Furthermore, in the proposed research on thehybrid of model-free and based RL, we introduce the advantages of model-free RL approach into our model-based framework. We think this approach is a most practical RL to the challenges mentioned in the introduction.

Planning with Physics Simulators: There are physics engines such as Open Dynamics Engine (ODE;[56]), Bullet Physics [1], MuJoCo [65], and so on. ODE is widely used in robotics. MuJoCo is used as adynamics simulator of Gym, a set of benchmark problems for RL developed by OpenAI [15]. Theoreticallyit is possible to plan behaviors of robots with a dynamics simulator. MuJoCo is designed efficiently for suchpurposes; e.g. motions of a humanoid robot are optimized by model predictive control in [17]. Each of thesimulators uses its own contact model, that extends rigid body dynamics to be closer to real physics. Howevertheir simulation capabilities of non-rigid objects are very limited. Although it is possible to simulate non-rigidbody dynamics with rigid objects (e.g. simulating liquids with many spheres [69]), the results are far from real.Our hierarchical hybrid-reality simulator can go beyond them. The core idea is learning (sub)task-level dynamicmodels with a skill library. Furthermore, it can simulate non-object things (concepts).

Deep Reinforcement Learning: Recently deep learning neural networks (DNN) are becoming popular,and many researchers are investigating their application to reinforcement learning (e.g. [32, 41, 70]). Similarto our approach, DeepMPC uses neural networks to learn models [32]. A notable difference of our approachis that we learn task-level dynamic systems that are robust to the simulation-bias issue and suite for learningdynamical systems consisting of many sub-skills. Additionally, we introduce a probabilistic computation intoneural networks which also reduces the simulation-bias issue.

Hierarchical Reinforcement Learning: The idea to introduce hierarchy has been considered in decades,known as macros, options, chunks, schemes, primitives, basic behaviors, sub-skills, and so on (e.g. [6, 54, 51,8]). In reinforcement learning, a variety of hierarchical structures are proposed (cf. [9, 16]). Sutton et al.proposed options which are generalized actions of primitive and macro actions under the RL framework [60].There are some work on finding options or subgoals automatically [39, 40, 57]. In robotics, Kirchner applied ahierarchical Q-learning (HQL) to learn a forward movement of a six-legged robot [22].

Some of them learn policies (action value functions), while our method learns dynamical models whichwould have more advantages in sharing knowledge between tasks. Some of them assume discrete actions,while our method deal with continuous robot commands as well as discrete skill selections. Most of them wereapplied to toy examples, while we apply our method to practical robotic domains including deformable object

7

Page 8: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

manipulation. Thus, we contribute to this field by providing practical methods.Learning from Demonstration: Although our main focus is hierarchical hybrid-reality simulator, this

work would also be considered as a learning from demonstration (LfD) research [11], since the robot learnsnew knowledge (skills, dynamical systems) based on humans knowledge base. LfD is also known as imitationlearning, learning by watching, and programming by demonstrations. Many of such work focus on learningfrom practice (actual executions) after LfD, which is known as learning from demonstration and practice [10].A successful method is learning movement primitives from demonstrations and applying reinforcement learning;e.g. ball-in-cup task [24] and making pancakes [25]. However these approaches are poor at generalizing. Forexamples in attempts of learn pouring [43, 47, 63, 28, 27, 49, 14], they typically focus only on a part of theentire pouring problem. On the other hand, we intend to create a method with which robots can perform tasksover a variety of situations. The key idea is the skill library and the dynamical models of skills.

Learning higher (symbolic level) task structures from human demonstrations are explored in many LfDstudies. For example Kuniyoshi et al. developed a learning-by-watching framework where the robot learnssymbolic task descriptions from human demonstrations [30]. More advanced work was done by Jakel et al. [21]where task descriptions are learned as graphs of constraints from demonstrations and actual motions are planned.Ramirez-Amaro et al. proposed a unified method where semantic representations are extracted from humandemonstrations with a dynamically growing ontology-based knowledge representation [48]. Their method wasverified in complex kitchen activities: making a pancake, making a sandwich, and setting the table. On the otherhand, there are LfD methods to learn whole-body motions from demonstrations and extracting symbolic-levelstructures [4, 20, 29]. These research inspired a lot on our work. The major contribution of our research iscreating (sub)task-level dynamic models even if they are hard to model analytically.

Artificial intelligence: The idea of graph-structured dynamical systems is known in the AI field (e.g. [52]).Our contribution to this field is the introduction of the numerical approaches (DDP, RL, DNN) to the AI meth-ods, and its application to practical robot learning tasks (deformable object manipulation).

Optimal Control: An optimization to maximize sum of rewards (or minimize sum of costs) over a se-quence of actions is referred to as dynamic programming in RL field [62], while it is also known as optimalcontrol in control theory. Well-known solutions are model predictive control (MPC; e.g. [17, 32]) and differ-ential dynamic programming (DDP; [38]). There is a great deal of work on DDP. Some of it are stochastic(e.g. [46]) similar to ours. Usually DDP methods use a second-order gradient method (e.g. [46, 33]), while weuse a first-order algorithm [74]. Our approach is simpler to implement. Previous DDP and MPC methods con-sider linear-structured dynamical systems (including a single loop structure) while we consider graph-structureddynamical systems. Furthermore, we provide a method to deal with dynamical systems that are hard to model.

Deformable Object Manipulation: Since there are many types of deformable objects, deformableobject manipulation research also varies, such as folding towels [37], flipping pancakes [25], and cutting veg-etables [32]. There is much work on robot pouring [73, 45, 47, 43, 21, 63, 5, 49, 14, 50, 13] including learningto improve pouring skills. They focus on particular situations only, i.e. their behaviors do not generalize indifferent situations. Many of them do not scale up to other types of deformable object manipulation.

7 CMU Resources

The Search-based Planning Lab provides a PR2 robot for pouring experiments (Figure 2). The PR2 robot hastwo 7 DOF arms, two parallel grippers on each arm, a lift-type torso, and an omni-directional mobile platform.Its arm payload is 1.8 kg, the grip force is 80 N, and the grip range is 0 to 90 mm, which have been enough forour pouring experiments so far.

We also have a Baxter research robot for the proposed work (Figure 2). We verified the same pouringstrategy works on the Baxter as shown in this video: https://youtu.be/NIn-mCZ-h_g . It has two 7DOF arms, and two different types of parallel grippers. Its arm payload is 2.2 kg. One gripper’s grip force is44 N with a grip range 37 to 75 mm, and the other gripper’s grip force is 100 N with a grip range range 0 to 84mm. The Baxter robot has torque sensors on each joint.

8

Page 9: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

Figure 2: Left: The PR2 robot. This figure shows the setup for pouring experiments where three externalcameras are placed to measure the container locations, and material flow. Right: Our Baxter research robot.

References[1] Bullet physics library. http://bulletphysics.org/. [Online; accessed Aug-29-2016].[2] E. W. Aboaf, C. G. Atkeson, and D. J. Reinkensmeyer. Task-level robot learning. In IEEE International Conference on Robotics and Automation, pages

1309–1310, 1988.[3] E. W. Aboaf, S. M. Drucker, and C. G. Atkeson. Task-level robot learning: juggling a tennis ball more accurately. In the IEEE International Conference on

Robotics and Automation, pages 1290–1295, 1989.[4] J.K. Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM Comput. Surv., 43(3):16:1–16:43, 2011.[5] Baris Akgun, Maya Cakmak, Karl Jiang, and Andrea Lockerd Thomaz. Keyframe-based learning from demonstration - method and evaluation. I. J. Social

Robotics, 4:343–355, 2012.[6] R. C. Arkin. Behavior-Based Robotics. MIT Press, Cambridge, MA, 1998.[7] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning. Artificial Intelligence Review, 11:11–73, 1997.[8] A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Systems, 13:41–77, 2003.[9] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.

[10] Darrin C. Bentivegna. Learning from Observation Using Primitives. PhD thesis, Georgia Institute of Technology, 2004.[11] Aude Billard and Daniel Grollman. Robot learning by demonstration. Scholarpedia, 8(12):3824, 2013.[12] Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. Interpreting and executing recipes with a cooking robot. In the 13th

International Symposium on Experimental Robotics, pages 481–495, 2013.[13] C. Bowen and R. Alterovitz. Asymptotically optimal motion planning for tasks using learned virtual landmarks. IEEE Robotics and Automation Letters,

1(2):1036–1043, 2016.[14] Sascha Brandl, Oliver Kroemer, and Jan Peters. Generalizing pouring actions between objects using warped parameters. In the 14th IEEE-RAS International

Conference on Humanoid Robots (Humanoids’14), pages 616–621, Madrid, 2014.[15] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ArXiv e-prints,

(arXiv:1606.01540), 2016.[16] Shahar Cohen, Oded Maimon, and Evgeni Khmlenitsky. Reinforcement learning with hierarchical decision-making. In ISDA ’06: Proceedings of the Sixth

International Conference on Intelligent Systems Design and Applications, pages 177–182, USA, 2006. IEEE Computer Society.[17] T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov. An integrated system for real-time model predictive control of humanoid robots. In 2013

13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 292–299, 2013.[18] Fernando Fernandez, Javier Garcıa, and Manuela Veloso. Probabilistic Policy Reuse for inter-task transfer learning. Robotics and Autonomous Systems,

58(7):866–871, 2010.[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.[20] Tetsunari Inamura, Iwaki Toshima, Hiroaki Tanie, and Yoshihiko Nakamura. Embodied symbol emergence based on mimesis theory. The International

Journal of Robotics Research, 4-5(23):363–377, 2004.[21] R. Jakel, S.R. Schmidt-Rohr, M. Losch, and R. Dillmann. Representation and constrained planning of manipulation strategies in the context of programming

by demonstration. In the IEEE International Conference on Robotics and Automation (ICRA’10), pages 162–169, 2010.[22] Frank Kirchner. Q-learning of complex behaviours on a six-legged walking machine. Robotics and Autonomous Systems, 25(3-4):253–262, 1998.[23] J. Kober, J. Andrew Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research, 32(11):1238–1274,

2013.[24] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171–203, 2011.[25] Petar Kormushev, Sylvain Calinon, and Darwin G. Caldwell. Robot motor skill coordination with EM-based reinforcement learning. In the IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS’10), pages 3232–3237, 2010.[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Informa-

tion Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.[27] O. Kroemer, E. Ugur, E. Oztop, and J. Peters. A kernel-based approach to direct action perception. In the IEEE International Conference on Robotics and

Automation (ICRA’12), pages 2605–2610, 2012.[28] K Kronander and A Billard. Online learning of varying stiffness through physical human-robot interaction. In the IEEE International Conference on Robotics

and Automation (ICRA’12), pages 1842–1849, 2012.[29] Dana Kulic, Wataru Takano, and Yoshihiko Nakamura. Incremental learning, clustering and hierarchy formation of whole body motion patterns using

adaptive hidden markov chains. The International Journal of Robotics Research, 27(7):761–784, 2008.[30] Yasuo Kuniyoshi, Masayuki Inaba, and Hirochika Inoue. Learning by watching: Extracting reusable task knowledge from visual observation of human

performance. IEEE Transactions on Robotics and Automation, 10:799–822, 1994.[31] Lars Kunze and Michael Beetz. Envisioning the qualitative effects of robot manipulation actions using simulation-based projections. Artificial Intelligence,

2015.

9

Page 10: Hierarchical Hybrid-Reality Simulator for Practical ...cga/sony1/description-original.pdf · As a deep reinforcement learning research, we present a practical use of deep neural networks

[32] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning deep latent features for model predictive control. In Robotics: Science and Systems(RSS’15), 2015.

[33] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems 26, pages207–215. Curran Associates, Inc., 2013.

[34] Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In the IEEE International Conferenceon Robotics and Automation (ICRA’15), 2015.

[35] Michael G. Madden and Tom Howley. Transfer of experience between reinforcement learning environments with progressive difficulty. Artificial IntelligenceReview, 21:375–398, June 2004.

[36] Emarc Magtanong, Akihiko Yamaguchi, Kentaro Takemura, Jun Takamatsu, and Tsukasa Ogasawara. Inverse kinematics solver for android faces with elasticskin. In Latest Advances in Robot Kinematics, pages 181–188, Innsbruck, Austria, 2012.

[37] Jeremy Maitin-Shepard, Marco Cusumano-Towner, Jinna Lei, and Pieter Abbeel. Cloth grasp point detection based on multiple-view geometric cues withapplication to robotic towel folding. In the IEEE International Conference on Robotics and Automation (ICRA’10), pages 2308–2315, 2010.

[38] David Mayne. A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control,3(1):85–95, 1966.

[39] Amy Mcgovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In the eighteenth internationalconference on machine learning, pages 361–368. Morgan Kaufmann, 2001.

[40] Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In ECML ’02: Proceedings of the13th European Conference on Machine Learning, pages 295–306, London, UK, 2002. Springer-Verlag.

[41] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.[42] J. Morimoto, G. Zeglin, and C.G. Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. In the IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS’03), volume 2, pages 1927–1932, 2003.[43] Manuel Muhlig, Michael Gienger, Sven Hellbach, Jochen J. Steil, and Christian Goerick. Task-level imitation learning using variance-based movement

optimization. In the IEEE International Conference on Robotics and Automation (ICRA’09), pages 1177–1184, 2009.[44] Scott Niekum, Sachin Chitta, Bhaskara Marthi, Sarah Osentoski, and Andrew G Barto. Incremental semantically grounded learning from demonstration. In

Robotics: Science and Systems 2013, 2013.[45] Yoshiyuki Noda, Ken’ichi Yano, and Kazuhiko Terashima. Control of self-transfer-type automatic pouring robot with cylindrical ladle. IFAC Proceedings

Volumes, 38(1):295–300, 2005.[46] Yunpeng Pan and Evangelos Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems 27, pages

1907–1915. Curran Associates, Inc., 2014.[47] Peter Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In the IEEE International

Conference on Robotics and Automation (ICRA’09), pages 763–768, 2009.[48] Karinne Ramirez-Amaro, Michael Beetz, and Gordon Cheng. Transferring skills to humanoid robots by extracting semantic representations from observations

of human activities. Artificial Intelligence, pages –, 2015.[49] Leonel Rozo, Pablo Jimenez, and Carme Torras. Force-based robot learning of pouring skills using parametric hidden Markov models. In the IEEE-RAS

International Workshop on Robot Motion and Control (RoMoCo), 2013.[50] Leonel Rozo, Joao Silverio, Sylvain Calinon, and Darwin Gordon Caldwell. Learning controllers for reactive and proactive behaviors in human-robot

collaboration. Frontiers in Robotics and AI, 3(30), 2016.[51] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995.[52] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Inc., 1995.[53] S. Schaal and C.G. Atkeson. Robot juggling: implementation of memory-based learning. In the IEEE International Conference on Robotics and Automation

(ICRA’94), pages 57–71, 1994.[54] R. A. Schmidt. Motor Learning and Control. Human Kinetics Publishers, Champaign, IL, 1988.[55] David Silver, Aja Huang, Chris J. Maddison, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.[56] R. Smith. Open dynamics engine (ODE). http://www.ode.org/. [Online; accessed Aug-29-2016].[57] Martin Stolle. Automated discovery of options in reinforcement learning. Master’s thesis, McGill University, February 2004.[58] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the Seventh International

Conference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.[59] Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2(4):160–163, 1991.[60] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.

Artificial Intelligence, 112:181–211, 1999.[61] Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael Bowling. Dyna-style planning with linear function approximation and prioritized

sweeping. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 528–536, 2008.[62] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.[63] Minija Tamosiunaite, Bojan Nemec, Ales Ude, and Florentin Worgotter. Learning to pour with a robot arm combining goal and shape learning for dynamic

movement primitives. Robotics and Autonomous Systems, 59(11):910–922, 2011.[64] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In the IEEE International

Conference on Robotics and Automation (ICRA’10), pages 2397–2403, may 2010.[65] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots

and Systems, pages 5026–5033, 2012.[66] L. Torrey and J. Shavlik. Transfer learning. In E. Soria, J. Martin, R. Magdalena, M. Martinez, and A. Serrano, editors, Handbook of Research on Machine

Learning Applications, chapter 11. IGI Global, 2009.[67] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 1653–1660, 2014.[68] Eiji Uchibe and Kenji Doya. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In the International Conference on

Simulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004.[69] Akihiko Yamaguchi and Christopher G. Atkeson. Differential dynamic programming with temporally decomposed dynamics. In the 15th IEEE-RAS

International Conference on Humanoid Robots (Humanoids’15), 2015.[70] Akihiko Yamaguchi and Christopher G. Atkeson. Neural networks and differential dynamic programming for reinforcement learning problems. In the IEEE

International Conference on Robotics and Automation (ICRA’16), 2016.[71] Akihiko Yamaguchi, Christopher G. Atkeson, and Tsukasa Ogasawara. Pouring skills with planning and learning modeled from human demonstrations.

International Journal of Humanoid Robotics, 12(3):1550030, 2015.[72] Akihiko Yamaguchi, Jun Takamatsu, and Tsukasa Ogasawara. DCOB: Action space for reinforcement learning of high dof robots. Autonomous Robots,

34(4):327–346, 2013.[73] K. Yano, T. Toda, and K. Terashima. Sloshing suppression control of automatic pouring robot by hybrid shape approach. In Decision and Control, 2001.

Proceedings of the 40th IEEE Conference on, volume 2, pages 1328–1333, 2001.[74] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. ArXiv e-prints, (arXiv:1212.5701), 2012.[75] Jianwei Zhang and Bernd Rossler. Self-valuing learning and generalization with application in visually guided grasping of complex objects. Robotics and

Autonomous Systems, 47(2-3):117–127, 2004.

10