natural actor-critic authors: jan peters and stefan schaal neurocomputing, 2008 cognitive robotics...
TRANSCRIPT
Natural Actor-CriticAuthors: Jan Peters and Stefan Schaal
Neurocomputing, 2008
Cognitive robotics 2008/2009Wouter Klijn
ContentContent / Introduction
Actor-CriticNatural gradientApplicationsConclusion
References
Actor-Critic
Separate memory structure for policy (Actor) and value function (Critic).
After each action the critic evaluates the new state and returns an error.
The actor and the critic are updated using this error.
The Actor-Critic Architecture [2]
Actor-Critic: NotationThe model in the article is based loosely on a MDP.
- Discrete time- Continues state set:- Continues action set:
The system:
- Start state: drawn from a start-state
distribution
- At any state the actor chooses an action
- The system transfers to a new state.
- The system yields a reward after each action.
Actor-Critic: Functions Goal of the ‘system’ is to find a policyThis goal is reached by optimizing the
normalized expected return as a function of the inputs
With the differential :
Problem: The meat and bones of the article gets lost in convoluted functions.
Solution: Use a (presumably) known model/system that can be improved using the same method [4].
Actor-Critic: Simplified modelActor:
- Universal function approximator e.g. Multi Layer Perceptron (MLP). - Gets error from the critic.- Gradient descent!
Critic: Baseline (based on example data or a constant) times a function containing learned information combined with the reward.
Natural Gradient : Vanilla Gradient Descent
The critic returns an error which in combination with the function approximator can be used to create an error function
The partial differential of this error function , the gradient can now be used to update the internal variables in the function approximator (and critic).
Gradient descent [3]
Natural Gradient: definitionAn ‘alternative’ gradient to update the
function approximator.Definition of the natural gradient:
Where denotes the transposed Fisher Information Matrix (FIM).
The FIM is a statistical construct that summarizes the mean and variation of the input data.
Used in combination with the natural gradient FIM gives the direction of steepest descent [4].
Natural Gradient: PropertiesThe natural gradient is a linear weighted
version of the normal (vanilla) gradient. Convergence to a local minimum is guaranteed. By choosing a more direct path to optimal solution faster
convergence is reached avoiding premature convergence.
Covariant: Independent of the coordinate frame. Averages out stochasticity resulting in smaller datasets
for estimating the correct data set.
Gradient landscape for the ‘vanilla’ and natural gradient. Adapted from [1]
Natural Gradient: plateausThe natural gradient is a solution to escape
from plateaus in the gradient landscape.Plateaus are parts where the gradients of a
function are extremely small. It takes considerate time to traverse these and are well know ‘feature’ of gradient descent methods.
Example function landscape showing multiple plateaus and the resulting error while traversing it with normal gradient steps (iterations) [5]
Applications: Cart-Pole BalancingWell known benchmark for reinforced
learning [1]Unstable non-linear system that can be
simulated.State: Action:Reward based on the current state with
constant baseline. (Episodic Actor-Critic)
Applications: Cart-Pole BalancingSimulated experiment with a sample rate of
60 hertz, Comparing natural and vanilla gradient Actor-Critic algorithms .
Results: The natural gradient implementation takes on average ten minutes to find an optimal solution. The vanilla gradient takes on average two hours.
Expected return policy ≈ error averaged over 100 simulated runs
Applications: Baseball Optimizing nonlinear dynamic motor
primitives for robotics.In plain English: Learning a robot to hit a
ball.Shows the usage of a baseline for the critic:
A teacher manipulating the robot. (LSTD-Q(λ) Actor-Critic)
State, action and reward not explicitly given but are based on the motor primitives (and presumably a camera input):
Optimal (red) , POMDP (dashed) and Actor-Critic motor primitives.
Applications: Baseball The task of the robot is to hit the ball so
that is flies as far as possible. The robot has seven degrees of freedom.
Initially the robot is taught by supervised learning and fails.Subsequently the performance is improved by the Natural Actor-critic.
Applications: Baseball Both learning methods eventually learn
their version of the best solution. However the POMDB requires 10^6 learning steps compared to 10^3 for the Natural Actor-Critic.
Remarkable is that the Natural Actor-critic subjectively has a solution that is closer to the teacher/optimal solution.
ConclusionsA novel policy-gradient reinforcement
learning method.Two distinct flavors:
-Episodic with a constant as a baseline function in the critic- LSTD-Q(λ) with a rich baseline (teacher) function.
The improved functioning can be traced back to the usage of the improved natural gradient which uses statistical information of the input data to optimize changes in the used learning functions.
ConclusionsThe preliminary versions of the method
have been implemented in wide range of real word applications: - Humanoid robots- Trafic light optimalisation- Multirobot systems- Gait optimalisation in robot locomotion.
References[1] J. Peters and S. Schaal, “Natural Actor Critic”.
Neurocomputing, 2008[2] R.S. Sutton and A.G. Barto, “Reinforcement
Learning:An Introduction” MIT Press, Cambridge, 1998Web version: http://www.cs.ualberta.ca/~sutton/book/the-book.html
[3] http://en.wikipedia.org/wiki/Gradient_descent[4] S. Amari “Natural Gradient Works Efficiently in
Learning” Neural Computation 10, 251–276 (1998)
[5] K. Fukumizu, S. Amari “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptrons”, Neural Networks, 2000