deeploco: dynamic locomotion skills using hierarchical deep reinforcement...

Download DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement van/papers/2017-TOG-deepLoco/2012017-04-28DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement

Post on 21-Apr-2018

215 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • DeepLoco: Dynamic Locomotion Skills UsingHierarchical Deep Reinforcement Learning

    XUE BIN PENG and GLEN BERSETH, University of British ColumbiaKANGKANG YIN, National University of SingaporeMICHIEL VAN DE PANNE, University of British Columbia

    Fig. 1. Locomotion skills learned using hierarchical reinforcement learning. (a) Following a varying-width winding path. (b) Dribbling a soccer ball. (c)Navigating through obstacles.

    Learning physics-based locomotion skills is a dicult problem, leadingto solutions that typically exploit prior knowledge of various forms. Inthis paper we aim to learn a variety of environment-aware locomotionskills with a limited amount of prior knowledge. We adopt a two-levelhierarchical control framework. First, low-level controllers are learned thatoperate at a ne timescale and which achieve robust walking gaits thatsatisfy stepping-target and style objectives. Second, high-level controllersare then learned which plan at the timescale of steps by invoking desiredstep targets for the low-level controller. The high-level controller makesdecisions directly based on high-dimensional inputs, including terrain mapsor other suitable representations of the surroundings. Both levels of thecontrol policy are trained using deep reinforcement learning. Results aredemonstrated on a simulated 3D biped. Low-level controllers are learned fora variety of motion styles and demonstrate robustness with respect to force-based disturbances, terrain variations, and style interpolation. High-levelcontrollers are demonstrated that are capable of following trails throughterrains, dribbling a soccer ball towards a target location, and navigatingthrough static or dynamic obstacles.

    Additional Key Words and Phrases: physics-based character animation, mo-tion control, locomotion skills, deep reinforcement learning

    ACM Reference format:Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. 2017.DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforce-ment Learning. ACM Trans. Graph. 36, 4, Article 41 (July 2017), 16 pages.DOI: http://dx.doi.org/10.1145/3072959.3073602

    1 INTRODUCTIONPhysics-based simulations of human skills and human movementhave long been a promising avenue for character animation, but ithas been dicult to develop the needed control strategies. Whilethe learning of robust balanced locomotion is a challenge by itself,further complexities are added when the locomotion needs to beused in support of tasks such as dribbling a soccer ball or navigatingamong moving obstacles. Hierarchical control is a natural approach

    2017 ACM. This is the authors version of the work. It is posted here for your personaluse. Not for redistribution. The denitive Version of Record was published in ACMTransactions on Graphics, https://doi.org/http://dx.doi.org/10.1145/3072959.3073602.

    towards solving such problems. A low-level controller (LLC) is de-sired at a ne timescale, where the goal is predominately aboutbalance and limb control. At a larger timescale, a high-level con-troller (HLC) is more suitable for guiding the movement to achievelonger-term goals, such as anticipating the best path through obsta-cles. In this paper, we leverage the capabilities of deep reinforcementlearning (RL) to learn control policies at both timescales. The useof deep RL allows skills to be dened via objective functions, whileenabling for control policies based on high-dimensional inputs, suchas local terrain maps or other abundant sensory information. Theuse of a hierarchy enables a given low-level controller to be reusedin support of multiple high-level tasks. It also enables high-levelcontrollers to be reused with dierent low-level controllers.

    Our principal contribution is to demonstrate that environment-aware 3D bipedal locomotion skills can be learned with a limitedamount of prior structure being imposed on the control policy. Insupport of this, we introduce the use of a two-level hierarchy fordeep reinforcement learning of locomotion skills, with both levelsof the hierarchy using an identical style of actor-critic algorithm. Tothe best of our knowledge, we demonstrate some of the most capabledynamic 3D walking skills for model-free learning-based methods,i.e., methods that have no direct knowledge of the equations ofmotion, character kinematics, or even basic abstract features suchas the center of mass, and no a priori control-specic feedbackstructure. Our method comes with its own limitations, which wealso discuss.

    2 RELATED WORKModeling movement skills, locomotion in particular, has a long his-tory in computer animation, robotics, and biomechanics. It has alsorecently seen signicant interest from the machine learning com-munity as an interesting and challenging domain for reinforcementlearning. Here we focus only on the most closely related physics-based work in computer animation and reinforcement learning.

    ACM Transactions on Graphics, Vol. 36, No. 4, Article 41. Publication date: July 2017.

    https://doi.org/http://dx.doi.org/10.1145/3072959.3073602

  • 41:2 Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne

    Physics-based Character Control: A recent survey on physics-based character animation and control techniques provides a com-prehensive overview of work in this area [Geijtenbeek and Pronost2012]. An early and enduring approach to controller design has beento structure control policies around nite state machines (FSMs)and feedback rules that use a simplied abstract model or feed-back law. These general ideas have been applied to human athleticsand running [Hodgins et al. 1995] and a rich variety of walkingstyles [Coros et al. 2010; Lee et al. 2010; Yin et al. 2007]. Manycontrollers developed for physics-based animation further use op-timization methods to improve controllers developed around anFSM-structure, or use an FSM to dene phase-dependent objectivesfor an inverse dynamics optimization to be solved at each time step.Policy search methods, e.g., stochastic local search or CMA [Hansen2006], can be used to optimize the parameters of the given controlstructures to achieve a richer variety of motions, e.g., [Coros et al.2011; Yin et al. 2008], and ecient muscle-driven locomotion [Wanget al. 2009]. Policy search has been successfully applied directlyto time-indexed splines and neural networks in order to learn avariety of bicycle stunts [Tan et al. 2014]. An alternative class ofapproach is given by trajectory optimization methods, which cancompute solutions oine, e.g., [Al Borno et al. 2013], that can beadapted for online model-predictive control [Hmlinen et al. 2015;Tassa et al. 2012], or that can compute optimized actions for thecurrent time-step using quadratic programming, e.g., [de Lasa et al.2010; Macchietto et al. 2009]. Wu and Popovi [2010] proposed ahierarchical framework that incorporates a footstep planner andmodel-predictive control for bipedal locomotion across irregularterrain. To further improve motion quality and enrich the motionrepertoire, data-driven models incorporate motion capture examplesin constructing controllers, most often using a learned or model-based trajectory tracking method [da Silva et al. 2008; Liu et al. 2016,2012; Muico et al. 2009; Sok et al. 2007].

    Reinforcement Learning for Simulated Locomotion: The neu-roanimator work [Grzeszczuk et al. 1998] stands out as an earlydemonstration of a form of direct policy gradient method using arecurrent neural network, as applied to 3D swimming movements.Recent years have seen a resurgence of eort towards tackling rein-forcement learning problems for simulated agents with continuousstate and continuous action spaces. Many of the example problemsconsist of agents in 2D and 3D physics-based simulations that learnto swim, walk, and hop. Numerous methods have been applied totasks that include planar biped locomotion or planar swimming:trajectory optimization [Levine and Abbeel 2014; Levine and Koltun2014; Mordatch and Todorov 2014]; trust region policy optimization(TRPO) [Schulman et al. 2015]; and actor-critic approaches [Lillicrapet al. 2015; Mnih et al. 2016; Peng and van de Panne 2016].

    Progress has recently been made on the harder problem of achiev-ing 3D biped locomotion using learned model-free policies withoutthe use of a priori control structures. Tackling this problem de novo,i.e., without any hints as to what walking looks like, is particularlychallenging. Recent work using generalized advantage estimationtogether with TRPO [Schulman et al. 2016] demonstrates sustained3D locomotion for a biped with ball feet. Another promising ap-proach uses trajectory optimization methods to provide supervision

    for a recurrent neural network to generate stepping movementsfor 3D bipeds and quadrupeds [Mordatch et al. 2015], and withthe nal motion coming from an optimization step rather than di-rectly from a forward dynamics simulation. Hierarchical controlof 3D locomotion control has been recently proposed [Heess et al.2016], and is perhaps the closest work to our own. Low-level, high-frequency controllers are rst learned during a pretraining phase, inconjunction with a provisional high-level controller with access totask-relevant information and that can communicate with the low-level controller via a modulatory signal. Once pretraining has beencompleted, the low-level controller structure is xed and other high-level controllers can then be trained. The 3D humanoid is trainedto navigate a slalom course, an impressive feat given the de novonature of the locomotion learning. In our work we demonstrate:(a)signicantly more natural locomotion and the ability to walk withmultiple styles that can be interpolated; (b) locomotion that hassignicant (quantied) robustness; (c) the learning of controllersfor four high-level tasks that use high-dimensional input, i.e., theability to see the surroundings using ego-centric terr

Recommended

View more >