Download - Policy Gradient in Continuous Time
Policy Gradient in Continuous Time
Presented by Hui Li
Duke University Machine Learning Group
May 30, 2007
by Remi Munos, JMLR 2006
Outline
• Introduction
• Discretized Stochastic Processes Approximation
• Model-free Reinforcement Learning (RL)
algorithm
• Example Results
Introduction of the Problem• Consider an optimal control problem with continuous state
System dynamics: ),( ttt uxf
dt
dx
Control State
• Objective: Find an optimal control (ut) that maximize the functional
)())(;( 0 Ttt xruxJ Objective function:
Deterministic process
Continuous state
• Consider a class of parameterized policies with ),( tt xtu
• Find parameter that maximize the performance measure
)),(;()( 0 ttxtxJV
• Standard approach is to use gradient ascent method
)( V object of the paper
Introduction of the Problem
Introduction of the Problem
How to compute )(V
• Finite-difference method
)()(
)(VV
V ii
This method requires a large number of trajectories to compute the gradient of performance measure.
• Pathwise estimation of the gradient
Compute the gradient using one trajectory only
Introduction of the Problem
Define tt xz
Dynamics of zt: ttxtt zxfxf
dt
dz)()(
Gradient
TTxTTx zxrxxrV )()()(
• In the reinforcement learning, is unknown. How
to approximate zt?)( txf
Pathwise estimation of the gradient
known
unknown
Discretized Stochastic Processes Approximation• A General Convergence Result
)( tt xf
dt
dxIf
• Discretization of the state
Stochastic policy
Stochastic discrete state process NntnX
0)(
Initialization: 00 xX
Jump in state
Proof of proposition 5:
From Taylor’s formula
The average jump:
Directly apply the Theorem 3, proposition 5 is proved.
)(),(),( 2 ouxfxxuxf tttt
)()(
)(),(),|(
),(),|()],([
2
2
oxf
ouxfxtu
uxfxtuuxfE
Uut
Uutt
• Discretization of the state gradient
Stochastic discrete state gradient processNntn
Z
0)(
Initialization: 00 X
With
Proof of proposition 6:
Since
then
Directly apply the Theorem 3, proposition 6 is proved.
Model-free Reinforcement Learning Algorithm
Let
In this stochastic approximation, is observed, and
is given, we only need to approximate
Least-Square Approximation of
Define
}|],[{)( ts uutctstS
The set of past discrete times t-cs t when action ut have been taken.
From Taylor’s formula, for all discrete time s,
We deduce
Where
We may derive an approximation of by solving the least-square problem:
Then we have
Here
denote the average value of
Algorithm
Experimental Results
Six continuous state:
x0, y0: hand position
x, y: mass position
vx, vy: mass velocity
Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}
Goal: reach a target (xG, yG) with the mass at specific time T
Terminal reward function
The system dynamics:
Consider a Boltzmann-like stochastic policy
where
Conclusion
• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters
• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process