algorithms and applications soft actor-critic

56
CS391R: Robot Learning (Fall 2021) Soft Actor-Critic Algorithms and Applications 1 Presenter: Zizhao Wang 10/12/2021

Upload: others

Post on 30-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021)

Soft Actor-Critic Algorithms and Applications

1

Presenter: Zizhao Wang

10/12/2021

Page 2: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 2

MotivationSample Efficiency

Reinforcement Learning (RL) can solve complex decision making and control problems, but

it is notoriously sample-inefficient.

Vinyals et al., 2019

“During training, each agent experienced up to 200 years of real-time StarCraft play.”

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Page 3: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 3

MotivationSample Efficiency

Reinforcement Learning (RL) can solve complex decision making and control problems, but

it is notoriously sample-inefficient.

Levine et al., 2016

● 14 robot arms learning to grasp in parallel

● Observing over 800,000 grasp attempts (3000 robot-hours of practice), we can see the beginnings of intelligent reactive behaviors.

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Page 4: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 4

MotivationRobustness to Hyperparameters

RL performance is brittle to hyperparameters, which needs laborious tuning.

Henderson et al., 2017

Reward scaling has a significant impact on DDPG performance.(DDPG will be introduced soon)

Deep Reinforcement Learning that Matters

Page 5: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 5

Problem SettingMarkov Decision Process (MDP)

State Space

Action Space

Transition Probability

Reward

Policy

Trajectory

Page 6: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 6

Contextbe

tter s

ampl

e ef

ficie

ncy

RL Ranking on Sample Efficiency

● model-based

learns a model of the environment + use the learned model to imagine interactions

Page 7: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 7

Contextbe

tter s

ampl

e ef

ficie

ncy

off-policy RLRL Ranking on Sample Efficiency

● model-based

● model-free

○ off-policy

can learns from data generated by a

different policy

Page 8: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 8

Contextbe

tter s

ampl

e ef

ficie

ncy

off-policy RL

on-policy RL

RL Ranking on Sample Efficiency

● model-based

● model-free

○ off-policy

can learns from data generated by a

different policy

○ on-policy

must learn from data generated by the

current policy

Page 9: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 9

Contextbe

tter s

ampl

e ef

ficie

ncy

off-policy RL

on-policy RL

RL Ranking on Sample Efficiency

● model-based

● model-free

○ off-policy

can learns from data generated by a

different policy

○ on-policy

must learn from data generated by the

current policy

Page 10: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 10

ContextActor-Critic

● Critic: estimates future return of state-action pair

● Actor: adjusts policy to maximize critic’s estimated future

return

Page 11: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 11

ContextActor-Critic

● Critic:

● Actor:

Deep Deterministic Policy Gradient (DDPG)

● Critic

● Actor

Page 12: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 12

ContextActor-Critic

● Critic:

● Actor:

Deep Deterministic Policy Gradient (DDPG)

● Critic

● Actor

● Deterministic policy makes training unstable and brittle

to hyperparameters.

Page 13: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 13

ContextStandard RL

Reward:

Maximum Entropy RL

reward:

Page 14: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 14

ContextStandard RL

Reward:

Entropy : measure of uncertainty

Maximum Entropy RL

reward:

Page 15: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 15

ContextMaximum Entropy RL

reward:

optimal policy:

Standard RL

reward:

optimal policy:

Page 16: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 16

ContextMaximum Entropy RL

reward:

optimal policy:

Standard RL

reward:

optimal policy:

Why is it helpful?● encourages exploration● enables multi-modal action selection

Page 17: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 17

ContextMaximum Entropy RL

reward:

optimal policy:

Standard RL

reward:

optimal policy:

Why is it helpful?● encourages exploration● enables multi-modal action selection

Page 18: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 18

MethodSoft Actor-Critic

reward:

Q-value:

Actor-Critic (DDPG)

reward:

Q-value:

Page 19: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 19

MethodSoft Actor-Critic

reward:

Q-value:

Q-value updates:

Actor-Critic (DDPG)

reward:

Q-value:

Q-value updates:

Page 20: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 20

MethodSoft Actor-Critic

reward:

Q-value:

Q-value updates:

policy improvement:

Actor-Critic (DDPG)

reward:

Q-value:

Q-value updates:

policy improvement:

Page 21: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 21

MethodSoft Actor-CriticActor-Critic (DDPG)

Page 22: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 22

MethodProof of Convergence

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

Page 23: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 23

MethodProof of Convergence

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

2. If we update the policy as follows, then .

Page 24: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 24

MethodProof of Convergence

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

2. If we update the policy as follows, then .

3. If we repeat step 1 and 2, we will find the optimal policy such that for

any policy and .

Page 25: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 25

MethodProof of Convergence: Are they realistic?

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

2. If we update the policy as follows, then .

3. If we repeat step 1 and 2, we will find the optimal policy such that for

any policy and .

Page 26: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 26

MethodProof of Convergence: Are they realistic?

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

2. If we update the policy as follows, then .

3. If we repeat step 1 and 2, we will find the optimal policy such that for

any policy and . For how many step? Possibly a lot.

Page 27: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 27

MethodProof of Convergence: Are they realistic?

Assumption: finite state and action space

1. If we update Q-value as follows, it will converge to as .

2. If we update the policy as follows, then .

3. If we repeat step 1 and 2, we will find the optimal policy such that for

any policy and . For how many step? Possibly a lot.

Not applicable to many robotics problems (continuous state/action) and not computationally tractable.

Page 28: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 28

MethodImplementation (Practical Approximation)

1. critic training

● convergence → gradient descent

Page 29: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 29

MethodImplementation (Practical Approximation)

1. critic training

● convergence → gradient descent

● remove expectation

○ next action

Page 30: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 30

MethodImplementation (Practical Approximation)

1. critic training

● convergence → gradient descent

● remove expectation

○ next action

○ entropy

Page 31: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 31

MethodImplementation (Practical Approximation)

1. critic training

● convergence → gradient descent

● remove expectation

○ next action

○ entropy

○ next state

Page 32: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 32

MethodImplementation (Practical Approximation)

2. actor

training

● convergence → gradient descent

● rewrite KL divergence

Page 33: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 33

MethodImplementation (Practical Approximation)

2. actor

training

● convergence → gradient descent

● rewrite KL divergence

Page 34: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 34

MethodImplementation (Practical Approximation)

2. actor

training

● convergence → gradient descent

● rewrite KL divergence

● scale loss by and omit constant

Page 35: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 35

MethodImplementation (Practical Approximation)

2. actor

training

● convergence → gradient descent

● rewrite KL divergence

● scale loss by and omit constant

● remove expectation over action

Page 36: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021)

2. actor

training: gradient descent with

design choice: policy as normal distribution

● but normal distribution is unimodal, it loses the declared multi-modal advantages.

36

MethodImplementation (Practical Approximation)

Page 37: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 37

MethodAutomatic Entropy Adjustment

Choosing the optimal is not trivial

● Recall for maximum entropy RL, the reward is .

Page 38: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 38

MethodAutomatic Entropy Adjustment

Choosing the optimal is not trivial

● Recall for maximum entropy RL, the reward is .

● So depends on the magnitude of , but can vary a lot across

○ different tasks

○ different policies as the policy gets improved

Page 39: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 39

MethodAutomatic Entropy Adjustment

Choosing the optimal is not trivial

● Recall for maximum entropy RL, the reward is .

● So depends on the magnitude of , but can vary a lot across

○ different tasks

○ different policies as the policy gets improved

● Ideal policy entropy should be

○ stochastic enough for exploration in uncertain regions

○ deterministic enough for performance in learned regions

Page 40: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 40

MethodAutomatic Entropy Adjustment

Choosing the optimal is not trivial

● Recall for maximum entropy RL, the reward is .

● So depends on the magnitude of , but can vary a lot across

○ different tasks

○ different policies as the policy gets improved

● Ideal policy entropy should be

○ stochastic enough for exploration in uncertain regions

○ deterministic enough for performance in learned regions

● Only constrain the average entropy across states

Page 41: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 41

MethodAutomatic Entropy Adjustment

Only constrain the average entropy across states

● : target entropy, always > 0

● increase if , decrease otherwise.

Page 42: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 42

MethodOverall Algorithm

Page 43: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 43

Simulated Benchmarks

State: joint value

Action: joint torque

Metric: average return

Ant Half Cheetah Walker Hopper Humanoid

Experiment

Page 44: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 44

Baselines

● SAC with learned

● SAC with fixed

● DDPG (off-policy, deterministic policy)

● TD3 (DDPG with engineering improvements)

● PPO (on-policy)

Experiment

Page 45: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 45

Baselines

● SAC with learned

● SAC with fixed

● DDPG (off-policy, deterministic policy)

● TD3 (DDPG with engineering improvements)

● PPO (on-policy)

Hypotheses

Compared to baselines, if SAC has better

● sample-efficiency: learning speed + final performance

● stability: performance on hard tasks where hyperparameters tuning is challenging

Experiment

Page 46: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 46

Experiment

Page 47: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 47

Simulated Benchmarks● Easy tasks (hopper, walker)

○ all algorithm performs comparably except for DDPG

ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO

Page 48: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 48

Simulated Benchmarks● Normal tasks (half cheetah, ant)

○ SAC > TD3 > DDPG & PPO in both learning speed and final performance

ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO

Page 49: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 49

Simulated Benchmarks● Hard tasks (humanoid)

○ DDPG & its variant TD3 fail to learn

○ SAC learns much faster than PPO

ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO

Page 50: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 50

Simulated Benchmarks● Larger variance with the automatic temperature adjustment

ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO

Page 51: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 51

Real World Quadrupedal Locomotion

State: low-dimensional

Challenges: sample efficiency & generalization to unseen environments

Training: 160k steps (2 hours)

Experiment

Page 52: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 52

Real World Quadrupedal Locomotion

Train on Flat

Test on Slope

Test with Obstacles

Test with Stairs

Experiment

Page 53: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 53

ExperimentReal World Manipulation

State: hand joint angles + image / ground truth valve position

Challenges: precepting the valve position from images

Comparison (using ground truth valve position): SAC (3 hrs) vs PPO (7.4 hrs)

Page 54: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 54

Limitations● Multi-modal policy

○ Though it is declared that maximum entropy RL can benefit from multi-modal policy, SAC

chooses to use a unimodal policy (normal distribution).

○ All experiments don’t require multi-modal behaviors to finish.

● Hyperparameter tuning

○ Target entropy brings a new hyperparameter to tune.

○ Average entropy constraints do not provide the desired exploration + exploitation balance.

○ For manipulation tasks that require accurate control, even averaged entropy regularization still

hurts the performance.

Page 55: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 55

Future Work and Extended Readings● Learn multi-modal policy rather than unimodal policy

○ Reinforcement Learning with Deep Energy-Based Policies

● Improve sample-efficiency with auxiliary tasks

○ Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

○ CURL: Contrastive Unsupervised Representations for Reinforcement Learning

● Improve generalizability by learning tasks-relevant features

○ Learning Invariant Representations for Reinforcement Learning without Reconstruction

○ Learning Task Informed Abstractions

Page 56: Algorithms and Applications Soft Actor-Critic

CS391R: Robot Learning (Fall 2021) 56

Summary● Problem: sample-efficient RL with automatic hyperparameter tuning

● Limitations of prior work:

○ On-policy RL is sample-inefficient

○ DDPG that uses deterministic policy is brittle to hyperparameters

● Key insights of the proposed work

○ Maximal entropy RL encourages exploration and robust to environments &

hyperparameters.

○ Use the average entropy across states as regularization.

● State-of-the-art sample efficiency and generalizabilities on simulated and real world tasks.