natural actor-critic authors: jan peters and stefan schaal neurocomputing, 2008 cognitive robotics...

Natural Actor-CriticAuthors: Jan Peters and Stefan Schaal

Neurocomputing, 2008

Cognitive robotics 2008/2009Wouter Klijn

ContentContent / Introduction

Actor-CriticNatural gradientApplicationsConclusion

References

Actor-Critic

Separate memory structure for policy (Actor) and value function (Critic).

After each action the critic evaluates the new state and returns an error.

The actor and the critic are updated using this error.

The Actor-Critic Architecture [2]

Actor-Critic: NotationThe model in the article is based loosely on a MDP.

- Discrete time- Continues state set:- Continues action set:

The system:

- Start state: drawn from a start-state

distribution

- At any state the actor chooses an action

- The system transfers to a new state.

- The system yields a reward after each action.

Actor-Critic: Functions Goal of the ‘system’ is to find a policyThis goal is reached by optimizing the

normalized expected return as a function of the inputs

With the differential :

Problem: The meat and bones of the article gets lost in convoluted functions.

Solution: Use a (presumably) known model/system that can be improved using the same method [4].

Actor-Critic: Simplified modelActor:

- Universal function approximator e.g. Multi Layer Perceptron (MLP). - Gets error from the critic.- Gradient descent!

Critic: Baseline (based on example data or a constant) times a function containing learned information combined with the reward.

Natural Gradient : Vanilla Gradient Descent

The critic returns an error which in combination with the function approximator can be used to create an error function

The partial differential of this error function , the gradient can now be used to update the internal variables in the function approximator (and critic).

Gradient descent [3]

Natural Gradient: definitionAn ‘alternative’ gradient to update the

function approximator.Definition of the natural gradient:

Where denotes the transposed Fisher Information Matrix (FIM).

The FIM is a statistical construct that summarizes the mean and variation of the input data.

Used in combination with the natural gradient FIM gives the direction of steepest descent [4].

Natural Gradient: PropertiesThe natural gradient is a linear weighted

version of the normal (vanilla) gradient. Convergence to a local minimum is guaranteed. By choosing a more direct path to optimal solution faster

convergence is reached avoiding premature convergence.

Covariant: Independent of the coordinate frame. Averages out stochasticity resulting in smaller datasets

for estimating the correct data set.

Gradient landscape for the ‘vanilla’ and natural gradient. Adapted from [1]

Natural Gradient: plateausThe natural gradient is a solution to escape

from plateaus in the gradient landscape.Plateaus are parts where the gradients of a

function are extremely small. It takes considerate time to traverse these and are well know ‘feature’ of gradient descent methods.

Example function landscape showing multiple plateaus and the resulting error while traversing it with normal gradient steps (iterations) [5]

Applications: Cart-Pole BalancingWell known benchmark for reinforced

learning [1]Unstable non-linear system that can be

simulated.State: Action:Reward based on the current state with

constant baseline. (Episodic Actor-Critic)

Applications: Cart-Pole BalancingSimulated experiment with a sample rate of

60 hertz, Comparing natural and vanilla gradient Actor-Critic algorithms .

Results: The natural gradient implementation takes on average ten minutes to find an optimal solution. The vanilla gradient takes on average two hours.

Expected return policy ≈ error averaged over 100 simulated runs

Applications: Baseball Optimizing nonlinear dynamic motor

primitives for robotics.In plain English: Learning a robot to hit a

ball.Shows the usage of a baseline for the critic:

A teacher manipulating the robot. (LSTD-Q(λ) Actor-Critic)

State, action and reward not explicitly given but are based on the motor primitives (and presumably a camera input):

Optimal (red) , POMDP (dashed) and Actor-Critic motor primitives.

Applications: Baseball The task of the robot is to hit the ball so

that is flies as far as possible. The robot has seven degrees of freedom.

Initially the robot is taught by supervised learning and fails.Subsequently the performance is improved by the Natural Actor-critic.

Applications: Baseball Both learning methods eventually learn

their version of the best solution. However the POMDB requires 10^6 learning steps compared to 10^3 for the Natural Actor-Critic.

Remarkable is that the Natural Actor-critic subjectively has a solution that is closer to the teacher/optimal solution.

ConclusionsA novel policy-gradient reinforcement

learning method.Two distinct flavors:

-Episodic with a constant as a baseline function in the critic- LSTD-Q(λ) with a rich baseline (teacher) function.

The improved functioning can be traced back to the usage of the improved natural gradient which uses statistical information of the input data to optimize changes in the used learning functions.

ConclusionsThe preliminary versions of the method

have been implemented in wide range of real word applications: - Humanoid robots- Trafic light optimalisation- Multirobot systems- Gait optimalisation in robot locomotion.

References[1] J. Peters and S. Schaal, “Natural Actor Critic”.

Neurocomputing, 2008[2] R.S. Sutton and A.G. Barto, “Reinforcement

Learning:An Introduction” MIT Press, Cambridge, 1998Web version: http://www.cs.ualberta.ca/~sutton/book/the-book.html

[3] http://en.wikipedia.org/wiki/Gradient_descent[4] S. Amari “Natural Gradient Works Efficiently in

Learning” Neural Computation 10, 251–276 (1998)

[5] K. Fukumizu, S. Amari “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptrons”, Neural Networks, 2000

natural actor-critic authors: jan peters and stefan schaal neurocomputing, 2008 cognitive robotics...

Documents

gradient landscape

natural gradient fim

propertiesthe natural

plateausthe natural

error function

normal vanilla gradient

value function critic

actorcritic architecture