reinforcement learning approaches for locomotion...

CS 760, Fall 2011

Reinforcement Learning Approaches for LocomotionPlanning in Interactive Applications

Tomislav Pejsa and Sean Andrist

AbstractLocomotion is one of the most important capabilities for virtual human agents in interactive appli-cations, because it allows them to navigate their environment. Locomotion controllers in interactiveapplications typically work by blending and concatenating clips of keyframe or motion capture motionthat represent individual locomotion actions (e.g. walk cycles), to generate sequences of natural-looking,task-appropriate character motion. The key challenge of locomotion implementation is planning - i.e.choosing an optimal sequence of locomotion actions that achieves a high-level navigation goal. In re-cent years researchers have successfully applied reinforcement learning to this problem. In this paperwe give an overview of these efforts, and demonstrate our own application of reinforcement learningto a simple navigation task.

1. Introduction

The term locomotion refers to AI- or user-directedmovement of a virtual character to a target loca-tion while avoiding both static and dynamic obstaclesalong the way. Locomotion in composite, dynamicallychanging environments is a fundamental challenge forvirtual human characters in interactive applications.A potential solution is to employ motion planning, aset of techniques originating from robotics that usesmart control policies for synthesis of long motion se-quences which accurately and efficiently accomplishcomplex high-level tasks, e.g., reaching the target des-tination while avoiding all obstacles. The main diffi-culty arises from the fact that the space of states andactions to search is huge; the character can move in apotentially infinite number of directions in potentiallyinfinite number of ways.

Characters are assumed to have a motion corpus ofavailable motions they can use to move about theirenvironment. This motion corpus can consist of eitherexample-based motions from a motion capture system,or clips of keyframe animation. In the work we presentin this paper, our character’s motion corpus consistsof the latter. Unlike techniques that search for appro-priate motion clips while using only local information,

motion planning methods consider the entire relevantstate space and generate motion sequences that areclose to being globally optimal, that is they are near-guaranteed to achieve objectives in the best, most ex-pedient manner possible. As performing global motionsearch at run-time is infeasible, it is necessary to per-form as much of the planning computations as possi-ble in a preprocessing step and then use the precom-puted data at run-time to make near optimal local de-cisions. Efficient and near-optimal planners have beenpreviously developed which employ control policies es-timated through the iterative process of reinforcementlearning.

In the next section we present background on thegeneral area of motion synthesis for computer ani-mation, especially motion planning in a data-drivenframework. In Section 3 we detail how reinforcementlearning is an effective approach for motion planningfor locomotion. We also discuss our own implementa-tion of locomotion planning using reinforcement learn-ing for a character navigating to a target in an envi-ronment with numerous obstacles, which can be seenin Figure 1.

2 T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications

Figure 1: Our character navigates an environment filled with obstacles to reach a specific target (the white pole).

2. Motion Synthesis for Computer Animation

There a number of possible approaches to motionsynthesis for character animation. Procedural andphysics-based animation methods employ a mathe-matical model to generate desired character motionaccording to high-level parameters. Because of itsparametric nature, this kind of motion is intuitivelycontrollable and suitable for interactive applications.However, it suffers greatly from a lack of naturalness.The underlying mathematical model is typically onlya rough approximation of actual physics that governthe movement of living creatures, while more com-plex models are too computationally expensive to beof any use in interactive applications. For this reasonprocedural and physics-based methods are still typi-cally used only in a few specific situations. Instead,it is quite common to use data-driven methods whichrely on having a database of recorded or hand-craftedmotion clips with which to synthesize new natural-looking motions.

2.1. Data-driven motion synthesis

The fundamental idea behind example-based or data-driven motion synthesis is to combine the controlla-bility of procedural animation with realistic appear-ance of recorded clips of motion. This is accomplishedwith two techniques: motion concatenation and para-metric motion synthesis. The former refers to the con-catenation of short motion segments into sequences ofactions. This is commonly done with motion graphs[KGP02], which are graph structures with motion clipsat the nodes and transitions between clips as edges.These edges must be precomputed between pointswhere the character poses are similar enough to makethe transition between clips smooth and natural. Thesecond technique, parametric motion synthesis, en-ables parametrically controlled interpolation of sim-ilar motion clips which correspond to the same logicalaction. Inverse kinematics (IK) techniques are used toenforce constraints, such as footplants to reduce footsliding.

2.2. Motion planning

Motion graphs and parametric motion are only use-ful if they can be used by high-level application mod-ules for synthesis of motion sequences that accomplishvarious goals at run-time. Translating these high-levelgoals into synthesis of low-level motion sequences isthe principal function of the character controller. Thecharacter controller is mainly responsible for locomo-tion, as well as both environment and character in-teraction. The set of techniques used for solving thelocomotion problem in a globally optimal way are col-lectively referred to as motion planning.

Motion planning for locomotion is a complex prob-lem for online motion synthesis. In most practical im-plementations, example-based motion synthesis is silldone by performing search of the character’s motioncorpus with local information. Such methods do notat all guarantee achievement of objectives in an op-timal manner and may fail altogether in more com-plex scenarios. These problems are illustrated in Fig-ure 2, which depicts several characters failing to walkthrough an open gate. Similarly in our implementa-tion, presented later, we show how a greedy searchstrategy almost always fails.

A number of more sophisticated approaches havebeen proposed. One of these is probabilistic roadmaps

Figure 2: Using only local search on motion graphsfor motion synthesis, several characters fail to navi-gate through the gate without bumping into it [LZ08]

T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications 3

(PRMs) [CLS03]. The environment is sampled forvalid character configurations, e.g., with footprints,which are then connected using motion clips in themotion corpus, resulting in a traversable graph of char-acter figure configurations convering the accessible en-vironment. Motion synthesis is done by computing aminimum-cost path through the roadmap that reachesthe target location. Another approach uses precom-puted search trees of states for the synthesis of longmotion sequences [LK06]. Most of these global ap-proaches suffer either from high computation complex-ity or too many approximating heuristics to work well.

Researchers in recent years have embraced rein-forcement learning as an effective technique for lo-comotion planning. Rather than precompute searchresults, reinforcement learning precomputes a controlpolicy that makes near-optimal decisions.

3. Reinforcement Learning for Locomotion

In reinforcement learning, objectives to accomplish areformulated as reward functions, which compute re-wards for being in a particular state and performingspecific actions (motion clips). Agents learn from anindirect, delayed reward, and choose sequences of ac-tions that produce the greatest cumulative reward. Acontrol policy is precomputed and used at run-timeto pick actions which maximize the long-term reward,thus ensuring synthesis of an optimal motion sequence.For our implementation, we use a common algorithmfor reinforcement learning called Q learning that canresult in optimal control strategies from delayed re-wards, even when the agent has no prior knowledge ofthe effects of its actions on the environment.

The simplest approach to Q learning, which we use,involves the construction of a Q table which lists allpossible state-action pairs. A valid concern is that theQ table becomes much too large, possibly even infi-nite. This can be alleviated by using different represen-tations, such as scattered data interpolation [IAF05],linear combination of basis functions [TLP07], and re-gression trees [LZ08]. In the end, at any state s theoptimal policy, π∗ will compute the best action to per-form. This is done with the following equation:

π∗(s) = arg maxa

Q(s, a) (1)

We build up our Q table by iteratively refining itsvalues until we have converged to the final values. Wedo this by exploring the state-action space and updat-ing Q values based on the rewards, r, received.

Q(s, a) = r(s, a) + γmaxa′

Q(s′, a′) (2)

Specifics on how to formulate the locomotion plan-ning problem in a reinforcement learning frameworkare provided in our implementation section.

3.1. Motion Synthesis

Once an optimal policy has been estimated, motionsynthesis is straightforward. As the current action fin-ishes, the control policy is used to pick the next actionA′ based on the current state s. The next state s′ iscomputed using the state transition function f(s,A′).Motion blending and transitions can be implementedin different ways. Discrete motion graphs are a sim-ple approach [LL04, IAF05], but they are not suit-able when responsiveness is critical, as the character’smovement is constrained to the graph. Motion frag-ments [TLP07] are another approach, where examplemotions are divided into fragments, each holding a se-quence of frames that represents a single walk cycle.Transitioning between motion fragments is done with-out a motion graph or constraint enforcement. Thiscan also be extended to parameterized motion frag-ments [LZ08] to support parametric motions.

3.2. Our implementation

To demonstrate the applicability of reinforcementlearning to the problem of interactive locomotion, weimplemented a simple reinforcement learner capableof learning a near-optimal locomotion control policyfor a virtual agent. Our planner can be embedded ina moderate-size environment containing a set of staticobstacles and a navigation target.

We built our system using the Unity game en-gine. Unity provides out-of-the-box support for ren-dering, character animation with transitions and pri-ority blending, scripting, and a number of other fea-tures needed for development of rich interactive ap-plications. We wrote our motion planner and learnerin Unity’s scripting language, on top of an examplecharacter controller provided with the engine.

3.2.1. State and reward formulation

We represent the character’s state as the following tu-ple:

s = (θ, d,3, o) (3)

where θ, d is the position of the target relative to theagent, expressed in spherical coordinates (relative tothe agent’s movement direction). Similarly, 3, o is therelative position of the obstacle nearest to the agent.Similar state representations are used in other rein-forcement learning papers, such as [TLP07].

4 T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications

The state reward is expressed as the following func-tion:

R(s) = wθ(1−|θ|180

)− wdd− wObs(1−| 3 |180

)e−0.75o)

(4)

where w∗ are weights assigned to different compo-nents of the reward. The reward function is designedto reward the agent for facing the target, and punishit for being distant from the target, for being in prox-imity of an obstacle, and for heading in the directionof the obstacle.

3.2.2. Learning algorithm

For the sake of simplicity we implemented 1-step Q-learning for training our controller. The action-valuefunction is expressed as a table of Q-values with anentry for every state-action pair. In every step of thelearning algorithm chooses an action to take, com-putes new state s′, and then updates the correspond-ing Q-value based on the following formula:

Q(s, a) = R(s′) + γmaxa′Q(s′, a′) (5)

where a′ is the optimal action to take in new states′, given the current Q-table. The discount factor γ isby default set to 0.1 in our implementation.

One challenge we faced was the fact that our statespace and action space are both continuous, whichmakes them unsuitable for the Q-table representation.We solve this problem in the simplest possible manner- by uniformly sampling our state and action spaces,and only keeping Q-values for our sample states andactions in the Q-table. Then we query the table usinga simple 1-nearest-neighbor method to determine thestate nearest to our query state. While this approachworks reasonably well for our small state space, a morescalable formulation of the Q function is an absolutenecessity for more complex scenarios.

Our exploration strategy is based on probabilisti-cally choosing which action to take. The probabilityof choosing a particular action a in the current states is expressed as:

P (s, a) =kQ(s,a)∑i∈A k

Q(s,i)(6)

k is a constant that is initially set to 1, and weincrement it in every learning episode. That way, allactions are initially equally likely, and the algorithmdoes a lot of exploration. But as learning progresses,higher probabilities are assigned to actions that have

higher Q values, and the algorithm becomes increas-ingly exploitative.

3.2.3. Results

We trained our locomotion planner on a moderate-sizeenvironment with a fixed layout of static obstacles.Locations of the agent and target were randomized atthe start of every learning episode.

Once training was done, we tested the performanceof the planner on a series of 12 environments of samesize, but randomized obstacle layout, target and agentlocations. We measured performance as the planner’ssuccess rate in reaching the target, and the time takento reach it. We compared its performance against asimple greedy search policy, designed to always choosethe action with the highest immediate reward.

Our results show that the reinforcement learning-based planner was able to find the target in 91.6%of cases, on average taking 28.40s to do so (M =28.40, SD = 16.32). On the other hand, the greedypolicy was successful in only 41.7% cases, though inthe cases when it managed to reach the target, it tookon average only 20.35s (M = 20.35, SD = 13.95). Fig-ure 3 shows the character using a greedy policy. It isnever able to reach its target. Contrast this with Fig-ure 4, in which the character, using a control policycomputed with reinforcement learning, quickly navi-gates the obstacle-filled environment to reach its tar-get.

Due to limitations of our testing framework and lackof time, we were unable to do more comprehensivetesting with multiple iterations and formal hypothesistesting. Even so, these early results indicate clear su-periority of the reinforcement learning-based controlpolicy in environment navigation tasks. We observedthat the greedy policy was only able to locate the tar-get in simpler scenes, with few obstacles, and the tar-get located close to the initial location of the agent.

Though our reinforcement learning-based plannerperforms better than greedy search, we did observe anumber of issues with it. It often had trouble navi-gating past a large number of obstacles, and occasion-ally made decisions that were clearly suboptimal. Webelieve these issues are caused by a number of fac-tors. One is overly coarse discretization of the stateand action space, which was necessary to make ourproblem scalable, but has been shown to dramaticallyimpair the navigational capabilities of locomotion con-trollers [LZ08]. Secondly, our state representation maybe overly simple. For example, we only encode the rel-ative location of the nearest obstacle in our state rep-resentation, even though our agent often finds itselfin the proximity of several obstacles at once. More-over, our reward function is relatively untested and

T. Pejsa & S. Andrist / Reinforcement Learning Approaches for Locomotion Planning in Interactive Applications 5

Figure 3: Using greedy search, the target is never reached by the character.

Figure 4: Using reinforcement learning, the target is quickly reached by the character.

its parameters may need to be adjusted further beforeit serves as a good measure of reward for our nav-igation problem. Finally, upon inspection we foundthat many states in our Q-table were never visited,indicating that the learner may be switching from theexploration to the exploitation strategy too quickly.Again, it may be necessary to experiment with differ-ent exploration parameter settings to make our learnersufficiently robust.

4. Conclusions

In this paper we discussed the use of reinforcementlearning approaches for motion planning, specificallycharacter locomotion. We show, both with a review ofprior work and our own implementation, that planningcharacter controllers based on reinforcement learningcan synthesize higher-quality motion sequences thancontrollers using greedy search. Moreover, they areusually able to generate a good motion sequence evenin scenarios where greedy controllers fail altogether.

References

[CLS03] Choi M. G., Lee J., Shin S. Y.: Planning bipedlocomotion using motion capture data and probabilisticroadmaps. ACM Trans. Graph. 22, 2 (2003), 182–203.

[IAF05] Ikemoto L., Arikan O., Forsyth D.: Learningto move autonomously in a hostile world. In SIGGRAPH’05: ACM SIGGRAPH 2005 Sketches (New York, NY,USA, 2005), ACM, p. 46.

[KGP02] Kovar L., Gleicher M., Pighin F.: Motiongraphs. ACM Trans. Graph. 21, 3 (2002), 473–482.

[LK06] Lau M., Kuffner J. J.: Precomputed searchtrees: planning for interactive goal-driven anima-tion. In SCA ’06: Proceedings of the 2006 ACMSIGGRAPH/Eurographics symposium on Computer

animation (Aire-la-Ville, Switzerland, Switzerland,2006), Eurographics Association, pp. 299–308.

[LL04] Lee J., Lee K. H.: Precomputing avatar behav-ior from human motion data. In SCA ’04: Proceedingsof the 2004 ACM SIGGRAPH/Eurographics symposiumon Computer animation (New York, NY, USA, 2004),ACM Press, pp. 79–87.

[LZ08] Lo W.-Y., Zwicker M.: Real-time planningfor parameterized human motion. In 2008 ACMSIGGRAPH / Eurographics Symposium on ComputerAnimation (July 2008), pp. 29–38.

[TLP07] Treuille A., Lee Y., Popovic Z.: Near-optimal character animation with continuous control.ACM Trans. Graph. 26, 3 (2007), 7.

reinforcement learning approaches for locomotion...

Documents