trajectory-based off-policy deep reinforcement learning · deep deterministic off-policy gradients...
TRANSCRIPT
Andreas Doerr1,2,3 Michael Volpp1 Marc Toussaint3 Sebastian Trimpe2 Christian Daniel1
[1] Bosch Center for Artificial Intelligence, Renningen, Germany
[2] Max Planck Institute for Intelligent Systems, Stuttgart/Tübingen, Germany
[3] Machine Learning & Robotics Lab, University of Stuttgart, Germany
Trajectory-Based Off-Policy Deep Reinforcement Learning
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Fast & Efficient Model-Free Reinforcement LearningTrajectory-Based Off-Policy Deep Reinforcement Learning
1
How far can we push “model-free” RL?
𝛻𝛻𝜃𝜃J =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
�𝑡𝑡=0
𝐻𝐻
𝛻𝛻𝜃𝜃 log𝜋𝜋 𝑎𝑎𝑡𝑡𝑖𝑖 𝑠𝑠𝑡𝑡
𝑖𝑖 ; 𝜃𝜃 𝑅𝑅(𝜏𝜏𝑖𝑖)
[1] Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. 1992
[1]
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Problems with Policy Gradient MethodsTrajectory-Based Off-Policy Deep Reinforcement Learning
1
Data inefficiency On-policy samples required No sample reuse
Problems
Gradient variance Stochastic policy Stochastic environment
Exploration vs. exploitation Step size control Policy (relative) entropy
How far can we push “model-free” RL?
𝛻𝛻𝜃𝜃J =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
�𝑡𝑡=0
𝐻𝐻
𝛻𝛻𝜃𝜃 log𝜋𝜋 𝑎𝑎𝑡𝑡𝑖𝑖 𝑠𝑠𝑡𝑡
𝑖𝑖 ; 𝜃𝜃 𝑅𝑅(𝜏𝜏𝑖𝑖)
[1] Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. 1992
[1]
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃)
1𝑁𝑁∑𝑗𝑗=0
𝑁𝑁 𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃𝑗𝑗)
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Implementation: Importance sampling with
empirical mixture distribution[1]
[1] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Implementation: Importance sampling with
empirical mixture distribution[2]
Reduced rollout stochasticity Richer behaviors with parameter
space exploration[2]
[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.
Deterministic Policy
Core concepts in DD-OPG
𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃)
1𝑁𝑁∑𝑗𝑗=0
𝑁𝑁 𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃𝑗𝑗)
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
1𝑁𝑁∑𝑗𝑗=0
𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Implementation: Importance sampling with
empirical mixture distribution[2]
Reduced rollout stochasticity Richer behaviors with parameter
space exploration[1]
Implementation: Model parameter Σ Length scale in action space
[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.
Deterministic Policy
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
1𝑁𝑁∑𝑗𝑗=0
𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Implementation: Importance sampling with
empirical mixture distribution[2]
Reduced rollout stochasticity Richer behaviors with parameter
space exploration[1]
Implementation: Model parameter Σ Length scale in action space
Policy search leveraginglower bound
[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.
Deterministic Policy DistributionalPolicy Search
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning
2
𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
1𝑁𝑁∑𝑗𝑗=0
𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)
Global ReturnDistribution Estimator
Incorporation of all data (off-policy) Backtracking to good solutions
Implementation: Importance sampling with
empirical mixture distribution[2]
Reduced rollout stochasticity Richer behaviors with parameter
space exploration[1]
Implementation: Model parameter Σ Length scale in action space
Policy search leveraginglower bound
Implementation: Estimation of empirical
sample size and variance
[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.
Deterministic Policy DistributionalPolicy Search
Core concepts in DD-OPG
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Return Distribution EstimatorTrajectory-Based Off-Policy Deep Reinforcement Learning
3
log Σ = −1 ⋅ I
Ground truth (MC) Estimator Lower bound [1,2]Data
Importance sampling estimate
Nor
mal
ized
retu
rn
Parameterspace
[1] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. ICML 2015.[2] Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. NeurIPS 2018.
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Return Distribution EstimatorTrajectory-Based Off-Policy Deep Reinforcement Learning
3
log Σ = 0 ⋅ I log Σ = −1 ⋅ I log Σ = −2 ⋅ I
Ground truth (MC) Estimator Lower bound [1,2]Data
global average Importance sampling estimate Monte Carlo estimate
Parameterspace
Nor
mal
ized
retu
rn
Parameterspace Parameterspace
[1] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. ICML 2015.[2] Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. NeurIPS 2018.
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Algorithmic ChoicesTrajectory-Based Off-Policy Deep Reinforcement Learning
4
DD-OPG REINFORCE TRPO PPO
Memory selection
All available trajectoriesPrioritized trajectory replay Only on-policy samples from current batch
Exploration Parameter space Action space
Objective 𝓛𝓛 𝜽𝜽 Expected returnlower bound
Expectedreturn
Expected return with KL constraint
Expected return (lower bound)
Optimization Fully optimized with backtracking
Onegradient
stepLocally optimized
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Experimental Results – From REINFORCE to DD-OPGTrajectory-Based Off-Policy Deep Reinforcement Learning
5
50
5
20
Trajectory replaybuffer size
Interaction steps (103)
Nor
mal
ized
retu
rn
0.0 2.5 5.0 7.5 10.0 12.5 15.0
1.0
0.8
0.6
0.4
0.2
0.0
Full DD-OPG
REINFORCENo memory1 gradient step
Memory1 gradient step
MemoryFully optimized
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
Experimental Results – Benchmark ResultsTrajectory-Based Off-Policy Deep Reinforcement Learning
6
Cartpole MountainCar Swimmer
DD-OPG REINFORCE TRPO PPO
System interactions (× 105)
GARAGE: continuous control environments Gaussian MLP policy (16, 16)
Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
ConclusionTrajectory-Based Off-Policy Deep Reinforcement Learning
7
Novel off-policy policy gradient methods
Enables data-efficient sample reuse
Incorporation of low-noise deterministic rollouts
Lengthscale in action space as only model assumption
Promising benchmark results
DD-OPG (red) benchmark results