reinforcement learning approaches for locomotion pages.cs.wisc.edu/~tpejsa/files/cs760- · cs 760,
Post on 06-Jun-2018
Embed Size (px)
CS 760, Fall 2011
Reinforcement Learning Approaches for LocomotionPlanning in Interactive Applications
Tomislav Pejsa and Sean Andrist
AbstractLocomotion is one of the most important capabilities for virtual human agents in interactive appli-cations, because it allows them to navigate their environment. Locomotion controllers in interactiveapplications typically work by blending and concatenating clips of keyframe or motion capture motionthat represent individual locomotion actions (e.g. walk cycles), to generate sequences of natural-looking,task-appropriate character motion. The key challenge of locomotion implementation is planning - i.e.choosing an optimal sequence of locomotion actions that achieves a high-level navigation goal. In re-cent years researchers have successfully applied reinforcement learning to this problem. In this paperwe give an overview of these efforts, and demonstrate our own application of reinforcement learningto a simple navigation task.
The term locomotion refers to AI- or user-directedmovement of a virtual character to a target loca-tion while avoiding both static and dynamic obstaclesalong the way. Locomotion in composite, dynamicallychanging environments is a fundamental challenge forvirtual human characters in interactive applications.A potential solution is to employ motion planning, aset of techniques originating from robotics that usesmart control policies for synthesis of long motion se-quences which accurately and efficiently accomplishcomplex high-level tasks, e.g., reaching the target des-tination while avoiding all obstacles. The main diffi-culty arises from the fact that the space of states andactions to search is huge; the character can move in apotentially infinite number of directions in potentiallyinfinite number of ways.
Characters are assumed to have a motion corpus ofavailable motions they can use to move about theirenvironment. This motion corpus can consist of eitherexample-based motions from a motion capture system,or clips of keyframe animation. In the work we presentin this paper, our characters motion corpus consistsof the latter. Unlike techniques that search for appro-priate motion clips while using only local information,
motion planning methods consider the entire relevantstate space and generate motion sequences that areclose to being globally optimal, that is they are near-guaranteed to achieve objectives in the best, most ex-pedient manner possible. As performing global motionsearch at run-time is infeasible, it is necessary to per-form as much of the planning computations as possi-ble in a preprocessing step and then use the precom-puted data at run-time to make near optimal local de-cisions. Efficient and near-optimal planners have beenpreviously developed which employ control policies es-timated through the iterative process of reinforcementlearning.
In the next section we present background on thegeneral area of motion synthesis for computer ani-mation, especially motion planning in a data-drivenframework. In Section 3 we detail how reinforcementlearning is an effective approach for motion planningfor locomotion. We also discuss our own implementa-tion of locomotion planning using reinforcement learn-ing for a character navigating to a target in an envi-ronment with numerous obstacles, which can be seenin Figure 1.
2 T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications
Figure 1: Our character navigates an environment filled with obstacles to reach a specific target (the white pole).
2. Motion Synthesis for Computer Animation
There a number of possible approaches to motionsynthesis for character animation. Procedural andphysics-based animation methods employ a mathe-matical model to generate desired character motionaccording to high-level parameters. Because of itsparametric nature, this kind of motion is intuitivelycontrollable and suitable for interactive applications.However, it suffers greatly from a lack of naturalness.The underlying mathematical model is typically onlya rough approximation of actual physics that governthe movement of living creatures, while more com-plex models are too computationally expensive to beof any use in interactive applications. For this reasonprocedural and physics-based methods are still typi-cally used only in a few specific situations. Instead,it is quite common to use data-driven methods whichrely on having a database of recorded or hand-craftedmotion clips with which to synthesize new natural-looking motions.
2.1. Data-driven motion synthesis
The fundamental idea behind example-based or data-driven motion synthesis is to combine the controlla-bility of procedural animation with realistic appear-ance of recorded clips of motion. This is accomplishedwith two techniques: motion concatenation and para-metric motion synthesis. The former refers to the con-catenation of short motion segments into sequences ofactions. This is commonly done with motion graphs[KGP02], which are graph structures with motion clipsat the nodes and transitions between clips as edges.These edges must be precomputed between pointswhere the character poses are similar enough to makethe transition between clips smooth and natural. Thesecond technique, parametric motion synthesis, en-ables parametrically controlled interpolation of sim-ilar motion clips which correspond to the same logicalaction. Inverse kinematics (IK) techniques are used toenforce constraints, such as footplants to reduce footsliding.
2.2. Motion planning
Motion graphs and parametric motion are only use-ful if they can be used by high-level application mod-ules for synthesis of motion sequences that accomplishvarious goals at run-time. Translating these high-levelgoals into synthesis of low-level motion sequences isthe principal function of the character controller. Thecharacter controller is mainly responsible for locomo-tion, as well as both environment and character in-teraction. The set of techniques used for solving thelocomotion problem in a globally optimal way are col-lectively referred to as motion planning.
Motion planning for locomotion is a complex prob-lem for online motion synthesis. In most practical im-plementations, example-based motion synthesis is silldone by performing search of the characters motioncorpus with local information. Such methods do notat all guarantee achievement of objectives in an op-timal manner and may fail altogether in more com-plex scenarios. These problems are illustrated in Fig-ure 2, which depicts several characters failing to walkthrough an open gate. Similarly in our implementa-tion, presented later, we show how a greedy searchstrategy almost always fails.
A number of more sophisticated approaches havebeen proposed. One of these is probabilistic roadmaps
Figure 2: Using only local search on motion graphsfor motion synthesis, several characters fail to navi-gate through the gate without bumping into it [LZ08]
T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications 3
(PRMs) [CLS03]. The environment is sampled forvalid character configurations, e.g., with footprints,which are then connected using motion clips in themotion corpus, resulting in a traversable graph of char-acter figure configurations convering the accessible en-vironment. Motion synthesis is done by computing aminimum-cost path through the roadmap that reachesthe target location. Another approach uses precom-puted search trees of states for the synthesis of longmotion sequences [LK06]. Most of these global ap-proaches suffer either from high computation complex-ity or too many approximating heuristics to work well.
Researchers in recent years have embraced rein-forcement learning as an effective technique for lo-comotion planning. Rather than precompute searchresults, reinforcement learning precomputes a controlpolicy that makes near-optimal decisions.
3. Reinforcement Learning for Locomotion
In reinforcement learning, objectives to accomplish areformulated as reward functions, which compute re-wards for being in a particular state and performingspecific actions (motion clips). Agents learn from anindirect, delayed reward, and choose sequences of ac-tions that produce the greatest cumulative reward. Acontrol policy is precomputed and used at run-timeto pick actions which maximize the long-term reward,thus ensuring synthesis of an optimal motion sequence.For our implementation, we use a common algorithmfor reinforcement learning called Q learning that canresult in optimal control strategies from delayed re-wards, even when the agent has no prior knowledge ofthe effects of its actions on the environment.
The simplest approach to Q learning, which we use,involves the construction of a Q table which lists allpossible state-action pairs. A valid concern is that theQ table becomes much too large, possibly even infi-nite. This can be alleviated by using different represen-tations, such as scattered data interpolation [IAF05],linear combination of basis functions [TLP07], and re-gression trees [LZ08]. In the end, at any state s theoptimal policy, will compute the best action to per-form. This is done with the following equation:
(s) = arg maxa
Q(s, a) (1)
We build up our Q table by iteratively refining itsvalues until we have converged to the final values. Wedo this by exploring the state-action space and updat-ing Q values based on the rewards, r, received.
Q(s, a) = r(s, a) + maxa
Q(s, a) (2)
Specifics on how to formulate the locomotion plan-ning problem in a reinforcement learning frameworkare provided in our implementation section.
3.1. Motion Synthesis
Once an optimal policy has been estimated, motionsynthesis is s