actor-critic algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7....
TRANSCRIPT
![Page 1: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/1.jpg)
Actor-Critic Algorithms
CS 294-112: Deep Reinforcement Learning
Sergey Levine
![Page 2: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/2.jpg)
Class Notes
1. Remember to start forming final project groups
![Page 3: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/3.jpg)
Today’s Lecture
1. Improving the policy gradient with a critic
2. The policy evaluation problem
3. Discount factors
4. The actor-critic algorithm
• Goals:• Understand how policy evaluation fits into policy gradients
• Understand how actor-critic algorithms work
![Page 4: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/4.jpg)
Recap: policy gradients
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
“reward to go”
![Page 5: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/5.jpg)
Improving the policy gradient
“reward to go”
![Page 6: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/6.jpg)
What about the baseline?
![Page 7: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/7.jpg)
State & state-action value functions
the better this estimate, the lower the variance
unbiased, but high variance single-sample estimate
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 8: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/8.jpg)
Value function fitting
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 9: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/9.jpg)
Policy evaluationgenerate
samples (i.e. run the policy)
fit a model to estimate return
improve the policy
![Page 10: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/10.jpg)
Monte Carlo evaluation with function approximation
the same function should fit multiple samples!
![Page 11: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/11.jpg)
Can we do better?
![Page 12: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/12.jpg)
Policy evaluation examples
TD-Gammon, Gerald Tesauro 1992 AlphaGo, Silver et al. 2016
![Page 13: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/13.jpg)
An actor-critic algorithm
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 14: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/14.jpg)
Aside: discount factors
episodic tasks continuous/cyclical tasks
![Page 15: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/15.jpg)
Aside: discount factors for policy gradients
![Page 16: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/16.jpg)
Which version is the right one?
Further reading: Philip Thomas, Bias in natural actor-critic algorithms. ICML 2014
![Page 17: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/17.jpg)
Actor-critic algorithms (with discount)
![Page 18: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/18.jpg)
Break
![Page 19: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/19.jpg)
Architecture design
two network design + simple & stable- no shared features between actor & critic
shared network design
![Page 20: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/20.jpg)
Online actor-critic in practice
works best with a batch (e.g., parallel workers)
synchronized parallel actor-critic asynchronous parallel actor-critic
![Page 21: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/21.jpg)
Critics as state-dependent baselines
+ no bias- higher variance (because single-sample estimate)
+ lower variance (due to critic)
- not unbiased (if the critic is not perfect)
+ no bias
+ lower variance (baseline is closer to rewards)You’ll implement this for HW2!
![Page 22: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/22.jpg)
Control variates: action-dependent baselines
+ no bias- higher variance (because single-sample estimate)
+ goes to zero in expectation if critic is correct!
- not correct
use a critic without the bias (still unbiased), provided second term can be evaluated
Gu et al. 2016 (Q-Prop)
![Page 23: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/23.jpg)
Eligibility traces & n-step returns+ lower variance- higher bias if value is wrong (it always is)
+ no bias- higher variance (because single-sample estimate)
Can we combine these two, to control bias/variance tradeoff?
![Page 24: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/24.jpg)
Generalized advantage estimation
Schulman, Moritz, Levine, Jordan, Abbeel ‘16
Do we have to choose just one n?
Cut everywhere all at once!
Weighted combination of n-step returns
How to weight? Mostly prefer cutting earlier (less variance)
exponential falloff
similar effect as discount!
remember this?discount = variance reduction!
![Page 25: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/25.jpg)
Review
• Actor-critic algorithms:• Actor: the policy• Critic: value function• Reduce variance of policy gradient
• Policy evaluation• Fitting value function to policy
• Discount factors• Carpe diem Mr. Robot• …but also a variance reduction trick
• Actor-critic algorithm design• One network (with two heads) or two networks• Batch-mode, or online (+ parallel)
• State-dependent baselines• Another way to use the critic• Can combine: n-step returns or GAE
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 26: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/26.jpg)
Actor-critic examples
• High dimensional continuous control with generalized advantage estimation (Schulman, Moritz, L., Jordan, Abbeel ‘16)
• Batch-mode actor-critic
• Blends Monte Carlo and function approximator estimators (GAE)
![Page 27: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/27.jpg)
Actor-critic examples
• Asynchronous methods for deep reinforcement learning (Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16)
• Online actor-critic, parallelized batch
• N-step returns with N = 4
• Single network for actor and critic
![Page 28: Actor-Critic Algorithmsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/... · 2019. 7. 27. · Actor-critic suggested readings •Classic papers •Sutton, McAllester, Singh,](https://reader035.vdocuments.site/reader035/viewer/2022071419/6117d474ad6498565d519c34/html5/thumbnails/28.jpg)
Actor-critic suggested readings
• Classic papers• Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for
reinforcement learning with function approximation: actor-critic algorithms with value function approximation
• Deep reinforcement learning actor-critic papers• Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016).
Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
• Schulman, Moritz, L., Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
• Gu, Lillicrap, Ghahramani, Turner, L. (2017). Q-Prop: sample-efficient policy-gradient with an off-policy critic: policy gradient with Q-function control variate