introduction to deep reinforcement learning model-free methods · •people started to apply deep...

54
Introduction to Deep Reinforcement Learning Model-free Methods Zhiao Huang

Upload: others

Post on 26-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Introduction to Deep Reinforcement Learning

Model-free Methods

Zhiao Huang

Page 2: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Deep Reinforcement Learning Era

• In 2013, DeepMind uses Deep

Reinforcement learning to play Atari

Games

Mnih, Volodymyr, et al. "Playing atari with deep reinforcement

learning." arXiv preprint arXiv:1312.5602 (2013)

Page 3: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Deep Reinforcement Learning Era

• In March 2016, Alpha Go beat the

human champion Lee Sedol

Silver, David, et al. "Mastering the game of go without human

knowledge." Nature 550.7676 (2017): 354-359.

Page 4: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

• People started to apply deep learning into robotics

Deep RL for Robotics

Levine, Sergey, and Chelsea Finn. "Deep visual foresight for

planning robot motion." (2017).

Haarnoja, Tuomas, et al. "Learning to walk via deep

reinforcement learning." arXiv preprint arXiv:1812.11103 (2018).

Page 5: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

So…

What is deep reinforcement learning?

Why do we care about it?

Page 6: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Overview

• Introduction• Markov Decision Process

• Policy, value and the Bellman-Ford equation

• Q-learning• DQN/DDPG

• Policy Optimization• REINFORCE

• Actor-Critic Methods

Page 7: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Reinforcement Learning

Machine Learning

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

Page 8: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Reinforcement learning is modeled as a

Markov Decision Process

Page 9: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Markov Decision Process

Reinforcement learning (RL) is an area of machine

learning concerned with how software agents ought to

take actions in an environment in order to maximize the

notion of cumulative reward.

Basic reinforcement learning is modeled as a Markov

Decision Process

Page 10: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Markov Decision Process

Page 11: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Markov Decision Process

Reinforcement Learning is modeled as a Markov

Decision Process 𝑆, 𝐴, 𝑃𝑎 , 𝑅𝑎• A set of environment and agent state, 𝑆;

• A set of actions, 𝐴, of the agent;

• 𝑃𝑎 𝑠, 𝑠′ = Pr(𝑠𝑡+1 = 𝑠′|𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎) is the

probability of transition (at time 𝑡) from state 𝑠 to

state 𝑠′ under action 𝑎 (Markov property)

• 𝑅𝑎 𝑠, 𝑠′ = 𝑅(𝑠, 𝑎, 𝑠′) is the immediate reward

received after transitioning from state 𝑠 to 𝑠′, under

action 𝑎.

Note that the transition probability is identical in each

time step…

Page 12: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

An Example in Graph

In discrete and finite cases, MDP≈Graph

Page 13: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

An Example in Python

OpenAI Gym Interface

Page 14: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

An Example in Python

Page 15: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Finite-horizon vs. Infinite horizon

• MDP itself doesn’t define a time limit …

Page 16: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Finite-horizon vs. Infinite horizon

• MDP itself doesn’t define a time limit … which is

called an infinite horizon process

• But we can turn the finite horizon environment into an

infinite horizon environment by adding an extra

terminal state and the current time step into the

observation …

• In practice, we usually set a maximum time step:

Page 17: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Overview

• Introduction• Markov Decision Process

• Policy, value and the Bellman-Ford equation

• Q-learning• DQN/DDPG

• Policy Optimization• REINFORCE

• Actor-Critic Methods

Page 18: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

The solution to the MDP is a policy that maximize the

expected total reward

Page 19: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Policy

• Policy is a global mapping from states to action:

𝜋 𝑠 : 𝑆 → 𝐴

• Policy can be a neural network:

Page 20: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Policy

• Policy is a global mapping from states to actions:

𝜋 𝑠 : 𝑆 → 𝐴

• Sometimes we consider a random policy,

𝜋 𝑎 𝑠 : 𝑆 × 𝐴 → ℝ

𝑃(𝑎|𝑠) 𝑎0 𝑎1

𝑠0 0.5 0.5

𝑠1 0.0 1.0

𝑠2 1.0 0.0

Page 21: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Policy in Python

• Policy is a global mapping from states to actions:

𝜋 𝑠 : 𝑆 → 𝐴

Page 22: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Trajectory

• Every time we “sample” the action by the policy in the

environment, we get a trajectory:

𝜏 = {𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡−1, 𝑎𝑡−1, … }

Page 23: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Trajectory

• Every time we “sample” the action by the policy in the

environment, we get a trajectory:

𝜏 = {𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡−1, 𝑎𝑡−1, … }

For example:

A trajectory:

𝑆0, 𝑎0, 𝑆1, 𝑎1, 𝑆2, …

Page 24: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Trajectory

• Both the policy and the MDP are random!

• If we know 𝜋(𝑎𝑖|𝑠𝑖) and 𝑝 𝑠𝑖+1 𝑠𝑖 , 𝑎𝑖 , we can compute

the probability of a sampled trajectory:

𝑝 𝜏 = 𝑝(𝑠0) ෑ

𝑖=0…∞

𝜋 𝑎𝑖 𝑠𝑖 𝑝(𝑠𝑖+1|𝑠𝑖 , 𝑎𝑖)

• We don’t like products, so we usually consider the log-likelihood

log 𝑝 𝜏 = log 𝑝 𝑠0 +

𝑖=0…∞

log 𝜋(𝑎𝑖|𝑠𝑖) + log 𝑝(𝑠𝑖+1|𝑠𝑖, 𝑎𝑖)

Page 25: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Trajectory

• A policy will induce a distribution over the trajectories

𝜏~𝑝𝜋(𝜏) or 𝜏~𝑝𝜋(𝜏|𝑠0)

• This notation will be useful later

Page 26: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

The solution to the MDP is a policy that maximizes the

expected total reward

Page 27: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Reward

• The goal of the reinforcement learning is to find a

policy to maximize the total reward

Page 28: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Reward

• The goal of the reinforcement learning is to find a

policy to maximize the total reward

• In the discrete case, the total reward is similar to the

length of the path

Page 29: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Reward

• The total reward of a trajectory is the sum of the

rewards on each time step

𝑅 𝜏 =

𝑖=0…∞

𝑅(𝑠𝑖 , 𝑎𝑖 , 𝑠𝑖+1)

• If we fix the policy and the starting state, we can

calculate the expected rewards of the sampled

trajectories:

𝑉𝜋 𝑠0 = 𝐸𝜏~𝑝𝜋(𝜏|𝑠0) 𝑅 𝜏

Page 30: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Discount Factor

• The total reward of a trajectory is the sum of the

rewards at each time step

𝑅 𝜏 =

𝑖=0…∞

𝑅(𝑠𝑖 , 𝑎𝑖 , 𝑠𝑖+1)

• It may not converge for infinite-horizon trajectories

Page 31: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Discount Factor

• Let 0 ≤ 𝛾 < 1 and

𝑅𝛾 𝜏 =

𝑖=0…∞

𝛾𝑖𝑅(𝑠𝑖 , 𝑎𝑖 , 𝑠𝑖+1) ≤𝑅𝑚𝑎𝑥

1 − 𝛾

• We always add a discount factor into the rewards

• 𝛾 is usually omitted from the notation

Page 32: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

RL Problem

Formally, given an MDP, find policy 𝜋 to maximize the

expected future reward:

max𝜋

𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

where

𝑝 𝜏 = 𝑝 𝑠0 ෑ

𝑖=0…∞

𝜋 𝑎𝑖 𝑠𝑖 𝑝 𝑠𝑖+1 𝑠𝑖 , 𝑎𝑖 ,

𝑅𝛾 𝜏 =

𝑖=0…∞

𝛾𝑖𝑅 𝑠𝑖+1 𝑠𝑖 , 𝑎𝑖

Page 33: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

How can we find the optimal one?

Page 34: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Naïve Algorithm 1: Random search

We can random sample 𝜋 and evaluate it with Monte-

Carlo sampling:

max𝜋

𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• For i from 1 to ∞

• Sample policy 𝜋

• Sample 𝜏 to evaluate 𝑅𝜋 = 𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• Return 𝜋 if 𝑅𝜋 is large enough

Page 35: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Naïve Algorithm 1: Random search

We can random sample 𝜋 and evaluate it with Monte-

Carlo sampling:

max𝜋

𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• For i from 1 to ∞

• Sample policy 𝜋

• Sample 𝜏 to evaluate 𝑅𝜋 = 𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• Return 𝜋 if 𝑅𝜋 is large enough

We can speed it up with value iteration

Page 36: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Value Function

• Fix 𝜋, if we marginalize out the action 𝑎, for one step

transition:

𝑝 𝑠′ 𝑠 =

𝑎

𝜋 𝑎 𝑠 𝑝(𝑠′|𝑠, 𝑎)

𝑅 𝑠, 𝑠′ =

𝑎

𝑅 𝑠, 𝑎, 𝑠′ 𝜋(𝑎|𝑠)

𝑆0 𝑆1

𝑝 𝑠0|𝑠1 = 1

𝑅 𝑠0, 𝑠1 = 1

𝑝 𝑠1|𝑠1 = 1𝑅 𝑠1, 𝑠1 = 0

𝑠0,𝑠1,…

𝑖

𝑝(𝑠𝑖+1|𝑠𝑖)

𝑖

𝑅(𝑠𝑖 , 𝑠𝑖+1) =

𝜏

𝑝 𝜏 𝑅(𝜏)

Page 37: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Value Function

• Value function

𝑉𝜋 𝑠 = 𝐸𝜏~𝑝𝜋(𝜏|𝑠) 𝑅 𝜏

• The expected reward if the agent follows the policy 𝜋and is currently at the state 𝑠

• It’s well-defined because of the Markov property

Page 38: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Bellman Equations (1)

• 𝑉𝜋 satisfies the Bellman equations by definition:

𝑉𝜋 𝑠 = 𝐸𝑎∼𝜋 𝑠 ,𝑠′∼𝑝(𝑠′|𝑠,𝑎)[𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉𝜋(𝑠′)]

=

𝑠′

𝑅 𝑠, 𝑠′ + 𝛾𝑝 𝑠, 𝑠′ 𝑉𝜋(𝑠′)

𝑆0 𝑆1

𝑝 𝑠0|𝑠1 = 1

𝑅 𝑠0, 𝑠1 = 1

𝑝 𝑠1|𝑠1 = 1𝑅 𝑠1, 𝑠1 = 0

𝑉0 = 1 + 𝛾𝑉1𝑉1 = 0 + 𝛾𝑉1

Page 39: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Value Iteration

Bellman equation is a linear operator:

𝑉𝜋 𝑠 =

𝑠′

𝑅 𝑠, 𝑠′ + 𝛾𝑝 𝑠, 𝑠′ 𝑉𝜋(𝑠′)

⇒ 𝑉𝜋 = 𝑅 + 𝛾𝑃𝑉𝜋

Let 𝑓 𝑥 = 𝑅 + 𝛾𝑃𝑥, then 𝑥 is the fixed point of 𝑓.

Initialize with 𝑉0

For I from 1 to ∞:

In each step, let 𝑉𝑡+1 = 𝑓(𝑉𝑡)

Return 𝑉 when it converges.

Page 40: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Example

• R =10, B = 0.99

0 10 1

, we can check for any 𝑉0, the

iteration will converge at 𝑉 = [1, 0].

• 𝑓 is a contraction: let 𝑉′ be the real value function

𝑓 𝑉 − 𝑓(𝑉′) ∞ ≤ 𝛾 𝑃𝑉 − 𝑃𝑉′ ≤ 𝛾‖𝑉 − 𝑉′‖

𝑆0 𝑆1

𝑝 𝑠0|𝑠1 = 1

𝑅 𝑠0, 𝑠1 = 1

𝑝 𝑠1|𝑠1 = 1𝑅 𝑠1, 𝑠1 = 0

Page 41: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Naïve Algorithm 2: Policy Evaluation

We can random sample 𝜋 and evaluate it with value

iteration:

max𝜋

𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• For i from 1 to ∞

• Sample policy 𝜋

• Value iteration for 𝑅𝜋 = 𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• Return 𝜋 if 𝑅𝜋 is large enough

Page 42: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Better Approaches

We can random sample 𝜋 and evaluate it with value

iteration:

max𝜋

𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• For i from 1 to ∞

• Sample policy 𝜋 Optimize policy 𝜋

• Value iteration for 𝑅𝜋 = 𝐸𝜏~𝑝𝜋(𝜏) 𝑅𝛾(𝜏)

• Return 𝜋 if 𝑅𝜋 is large enough

Page 43: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Q function

• For all state 𝑠 and action 𝑎, we can define Q function

𝑄𝜋 𝑠, 𝑎 = 𝐸𝑠′~𝑝 𝑠′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉𝜋(𝑠

′)

⇒ 𝑉𝜋 𝑠 = 𝐸𝑎~𝜋(𝑠)

𝑄𝜋(𝑠, 𝑎)

Expected reward from state 𝑠 if we select the action 𝑎.

Page 44: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Bellman Equations (2)

• Consider the optimal policy 𝜋∗, we must have

𝜋∗ 𝑠 = argmax𝑎𝑄𝜋∗(𝑠, 𝑎)

Where 𝑄𝜋∗ 𝑠, 𝑎 = 𝐸𝑠′~𝑝 𝑠′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉𝜋∗(𝑠

′)

• Otherwise let 𝜋′ s = argmax𝑎𝑄(𝑠, 𝑎) produces a

better policy

𝑅′ + 𝛾𝑃′𝑉 𝑠 > (𝑅∗ + 𝛾𝑃∗𝑉)(𝑠)𝑓′ 𝑉 = 𝑅′ + 𝛾𝑃′𝑉 > 𝑅∗ + 𝛾𝑃∗𝑉 = 𝑓∗(𝑉)

Page 45: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Bellman Equations (2)

Put them together:

𝜋∗ 𝑠 = argmax𝑎𝑄(𝑠, 𝑎)

𝑄𝜋∗ 𝑠, 𝑎 = 𝐸𝑠~𝑝 𝑠′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉𝜋∗ 𝑠

𝑉𝜋∗ = 𝐸𝑎∼𝜋∗ 𝑠 [𝑄𝜋∗(𝑠, 𝑎)]

⇒ 𝑉∗ 𝑠 = max𝑎

𝐸𝑠′~𝑝(𝑠,𝑎)[𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗(𝑠′)]

Page 46: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Bellman Equations (2)

• The Bellman equations (2):

𝑉∗ 𝑠 = max𝑎

𝐸𝑠′~𝑝(𝑠,𝑎)[𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗(𝑠′)]

• The same with the expectation case, we can prove

that this is a contraction, so it has a unique fixed point.

Page 47: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Bellman Equations (2)

• The Bellman equations (2):

𝑉∗ 𝑠 = max𝑎

𝐸𝑠′~𝑝(𝑠,𝑎)[𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗(𝑠′)]

• We only need to find the policy that satisfies:

𝜋 𝑠 = argmax𝑎𝑄𝜋(𝑠, 𝑎)

which is a fixed point of the bellman equations

Page 48: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Algorithm 3: Policy Iteration

• We only need to find the policy that satisfies:

𝜋 𝑠 = argmax𝑎𝑄𝜋(𝑠, 𝑎)

• Policy iteration methods iteratively adjust the policy for

each state to satisfy the above constraints

• For i from 1 to ∞

• Compute 𝑉𝜋 and 𝑄𝜋 with the value iteration

• ∀𝑠, 𝜋 𝑠 = argmax𝑎𝑄𝜋 𝑠, 𝑎

• Return 𝜋 if it converges

Page 49: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Policy Iteration

Page 50: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Value Iteration

• We can consider the previous algorithm in the view of

value function

• The optimal value function 𝑉∗ = 𝑉𝜋∗ should satisfy the

Bellman Equation for all states:

𝑉∗ 𝑠 = max𝑎[𝐸𝑠′~𝑝 𝑠,𝑎 [𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗(𝑠′)]

• 𝑉∗ is the fixed point of the operator 𝐵(𝑉)

• Find the fixed point iteratively

Page 51: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Algorithm 4: Value Iteration

Page 52: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Value Iteration and the Policy Iteration

• Value iteration and the policy iteration all try to solve

the Bellman Equation;

• Policy Iteration methods: maintain the value functions

for the current policy, and then adjust the policy based

on

𝜋 𝑠 = argmax𝑎𝑄𝜋(𝑠, 𝑎)

• Value Iteration methods: solve 𝑉∗ directly with

𝑉∗ 𝑠 = max𝑎

𝐸𝑠′~𝑝(𝑠,𝑎)[𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗(𝑠′)]

Page 53: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

What’s the trouble?

• 𝑆 is a high-dimensional space

• We can’t maintain 𝜋, 𝑄 and 𝑉 directly

• The model 𝑝 𝑠′ 𝑠, 𝑎 and the reward 𝑅(𝑠, 𝑎, 𝑠′) is

unknown

• we can only sample trajectories from the environment

with the policy

• Image input

Page 54: Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

Next

• Introduction• Markov Decision Process

• Policy, Value and the Bellman-Ford equation

• Deep Q-learning• DQN/DDPG

• Policy Optimization• REINFORCE

• Actor-Critic Methods