trajectory-based off-policy deep reinforcement learning · deep deterministic off-policy gradients...

16
Andreas Doerr 1,2,3 Michael Volpp 1 Marc Toussaint 3 Sebastian Trimpe 2 Christian Daniel 1 [1] Bosch Center for Artificial Intelligence, Renningen, Germany [2] Max Planck Institute for Intelligent Systems, Stuttgart/Tübingen, Germany [3] Machine Learning & Robotics Lab, University of Stuttgart, Germany Trajectory-Based Off-Policy Deep Reinforcement Learning

Upload: others

Post on 21-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Andreas Doerr1,2,3 Michael Volpp1 Marc Toussaint3 Sebastian Trimpe2 Christian Daniel1

[1] Bosch Center for Artificial Intelligence, Renningen, Germany

[2] Max Planck Institute for Intelligent Systems, Stuttgart/Tübingen, Germany

[3] Machine Learning & Robotics Lab, University of Stuttgart, Germany

Trajectory-Based Off-Policy Deep Reinforcement Learning

Page 2: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Fast & Efficient Model-Free Reinforcement LearningTrajectory-Based Off-Policy Deep Reinforcement Learning

1

How far can we push “model-free” RL?

𝛻𝛻𝜃𝜃J =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

�𝑡𝑡=0

𝐻𝐻

𝛻𝛻𝜃𝜃 log𝜋𝜋 𝑎𝑎𝑡𝑡𝑖𝑖 𝑠𝑠𝑡𝑡

𝑖𝑖 ; 𝜃𝜃 𝑅𝑅(𝜏𝜏𝑖𝑖)

[1] Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. 1992

[1]

Page 3: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Problems with Policy Gradient MethodsTrajectory-Based Off-Policy Deep Reinforcement Learning

1

Data inefficiency On-policy samples required No sample reuse

Problems

Gradient variance Stochastic policy Stochastic environment

Exploration vs. exploitation Step size control Policy (relative) entropy

How far can we push “model-free” RL?

𝛻𝛻𝜃𝜃J =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

�𝑡𝑡=0

𝐻𝐻

𝛻𝛻𝜃𝜃 log𝜋𝜋 𝑎𝑎𝑡𝑡𝑖𝑖 𝑠𝑠𝑡𝑡

𝑖𝑖 ; 𝜃𝜃 𝑅𝑅(𝜏𝜏𝑖𝑖)

[1] Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. 1992

[1]

Page 4: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

Core concepts in DD-OPG

Page 5: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Core concepts in DD-OPG

Page 6: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃)

1𝑁𝑁∑𝑗𝑗=0

𝑁𝑁 𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃𝑗𝑗)

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Implementation: Importance sampling with

empirical mixture distribution[1]

[1] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.

Core concepts in DD-OPG

Page 7: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Implementation: Importance sampling with

empirical mixture distribution[2]

Reduced rollout stochasticity Richer behaviors with parameter

space exploration[2]

[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.

Deterministic Policy

Core concepts in DD-OPG

𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃)

1𝑁𝑁∑𝑗𝑗=0

𝑁𝑁 𝑝𝑝(𝜏𝜏𝑖𝑖|𝜃𝜃𝑗𝑗)

Page 8: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

1𝑁𝑁∑𝑗𝑗=0

𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Implementation: Importance sampling with

empirical mixture distribution[2]

Reduced rollout stochasticity Richer behaviors with parameter

space exploration[1]

Implementation: Model parameter Σ Length scale in action space

[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.

Deterministic Policy

Core concepts in DD-OPG

Page 9: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

1𝑁𝑁∑𝑗𝑗=0

𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Implementation: Importance sampling with

empirical mixture distribution[2]

Reduced rollout stochasticity Richer behaviors with parameter

space exploration[1]

Implementation: Model parameter Σ Length scale in action space

Policy search leveraginglower bound

[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.

Deterministic Policy DistributionalPolicy Search

Core concepts in DD-OPG

Page 10: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Deep Deterministic Off-Policy Gradients (DD-OPG)Trajectory-Based Off-Policy Deep Reinforcement Learning

2

𝐽𝐽 𝜃𝜃 =1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

𝑤𝑤𝑖𝑖 𝜃𝜃 𝑅𝑅 𝜏𝜏𝑖𝑖 𝑤𝑤𝑖𝑖 𝜃𝜃 =𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

1𝑁𝑁∑𝑗𝑗=0

𝑁𝑁 ∏𝑡𝑡=0𝐻𝐻 𝑁𝑁 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 ,Σ)

Global ReturnDistribution Estimator

Incorporation of all data (off-policy) Backtracking to good solutions

Implementation: Importance sampling with

empirical mixture distribution[2]

Reduced rollout stochasticity Richer behaviors with parameter

space exploration[1]

Implementation: Model parameter Σ Length scale in action space

Policy search leveraginglower bound

Implementation: Estimation of empirical

sample size and variance

[1] Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz,M. Parameter space noise for exploration. ICLR 2018.[2] Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. NeurIPS 2010.

Deterministic Policy DistributionalPolicy Search

Core concepts in DD-OPG

Page 11: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Return Distribution EstimatorTrajectory-Based Off-Policy Deep Reinforcement Learning

3

log Σ = −1 ⋅ I

Ground truth (MC) Estimator Lower bound [1,2]Data

Importance sampling estimate

Nor

mal

ized

retu

rn

Parameterspace

[1] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. ICML 2015.[2] Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. NeurIPS 2018.

Page 12: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Return Distribution EstimatorTrajectory-Based Off-Policy Deep Reinforcement Learning

3

log Σ = 0 ⋅ I log Σ = −1 ⋅ I log Σ = −2 ⋅ I

Ground truth (MC) Estimator Lower bound [1,2]Data

global average Importance sampling estimate Monte Carlo estimate

Parameterspace

Nor

mal

ized

retu

rn

Parameterspace Parameterspace

[1] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. ICML 2015.[2] Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. NeurIPS 2018.

Page 13: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Algorithmic ChoicesTrajectory-Based Off-Policy Deep Reinforcement Learning

4

DD-OPG REINFORCE TRPO PPO

Memory selection

All available trajectoriesPrioritized trajectory replay Only on-policy samples from current batch

Exploration Parameter space Action space

Objective 𝓛𝓛 𝜽𝜽 Expected returnlower bound

Expectedreturn

Expected return with KL constraint

Expected return (lower bound)

Optimization Fully optimized with backtracking

Onegradient

stepLocally optimized

Page 14: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Experimental Results – From REINFORCE to DD-OPGTrajectory-Based Off-Policy Deep Reinforcement Learning

5

50

5

20

Trajectory replaybuffer size

Interaction steps (103)

Nor

mal

ized

retu

rn

0.0 2.5 5.0 7.5 10.0 12.5 15.0

1.0

0.8

0.6

0.4

0.2

0.0

Full DD-OPG

REINFORCENo memory1 gradient step

Memory1 gradient step

MemoryFully optimized

Page 15: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

Experimental Results – Benchmark ResultsTrajectory-Based Off-Policy Deep Reinforcement Learning

6

Cartpole MountainCar Swimmer

DD-OPG REINFORCE TRPO PPO

System interactions (× 105)

GARAGE: continuous control environments Gaussian MLP policy (16, 16)

Page 16: Trajectory-Based Off-Policy Deep Reinforcement Learning · Deep Deterministic Off-Policy Gradients (DD -OPG) Trajectory-Based Off-Policy Deep Reinforcement Learning 2 Global Return

Bosch Center for Artificial Intelligence | 2019-06-12© Robert Bosch GmbH 2018. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.

ConclusionTrajectory-Based Off-Policy Deep Reinforcement Learning

7

Novel off-policy policy gradient methods

Enables data-efficient sample reuse

Incorporation of low-noise deterministic rollouts

Lengthscale in action space as only model assumption

Promising benchmark results

DD-OPG (red) benchmark results