rl tutorial - csc2515 fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/r… · rl tutorial...
Post on 10-Aug-2020
18 Views
Preview:
TRANSCRIPT
RL TutorialCSC2515 Fall 2019
Adamo Young
University of Toronto
November 17 2019
Adamo Young (University of Toronto) RL Tutorial November 17 2019 1 28
Markov Decision Process (MDP)
St isin SAt isin ARt isin RTimesteps t isin N (discrete)
Finite SA both have finite number of elements (discrete)
Fully observable St is known exactly
Trajectory sequence of state-action-reward S0A0R1 S1A1
Adamo Young (University of Toronto) RL Tutorial November 17 2019 2 28
Markov Property
Dynamics of the MDP are completely captured in the current stateand the proposed action
States include all information about the past that make a differencefor the future
p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a
Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28
MDP Generalizations
Continuous timesteps
Continuous statesactions
Partially observable (POMDP) probability distribution over St
Donrsquot worry about them for now
Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28
Episodic vs Continuing Tasks
Episodic Task
t = 0 1 T
T (length of episode) canvary from episode to episode
Example Pacman
Continuing Task
No upper limit on t
Example learning to walk
Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Markov Decision Process (MDP)
St isin SAt isin ARt isin RTimesteps t isin N (discrete)
Finite SA both have finite number of elements (discrete)
Fully observable St is known exactly
Trajectory sequence of state-action-reward S0A0R1 S1A1
Adamo Young (University of Toronto) RL Tutorial November 17 2019 2 28
Markov Property
Dynamics of the MDP are completely captured in the current stateand the proposed action
States include all information about the past that make a differencefor the future
p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a
Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28
MDP Generalizations
Continuous timesteps
Continuous statesactions
Partially observable (POMDP) probability distribution over St
Donrsquot worry about them for now
Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28
Episodic vs Continuing Tasks
Episodic Task
t = 0 1 T
T (length of episode) canvary from episode to episode
Example Pacman
Continuing Task
No upper limit on t
Example learning to walk
Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Markov Property
Dynamics of the MDP are completely captured in the current stateand the proposed action
States include all information about the past that make a differencefor the future
p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a
Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28
MDP Generalizations
Continuous timesteps
Continuous statesactions
Partially observable (POMDP) probability distribution over St
Donrsquot worry about them for now
Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28
Episodic vs Continuing Tasks
Episodic Task
t = 0 1 T
T (length of episode) canvary from episode to episode
Example Pacman
Continuing Task
No upper limit on t
Example learning to walk
Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
MDP Generalizations
Continuous timesteps
Continuous statesactions
Partially observable (POMDP) probability distribution over St
Donrsquot worry about them for now
Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28
Episodic vs Continuing Tasks
Episodic Task
t = 0 1 T
T (length of episode) canvary from episode to episode
Example Pacman
Continuing Task
No upper limit on t
Example learning to walk
Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Episodic vs Continuing Tasks
Episodic Task
t = 0 1 T
T (length of episode) canvary from episode to episode
Example Pacman
Continuing Task
No upper limit on t
Example learning to walk
Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Expected Return and Unified Task Notation
Gt is the expected return (expected sum of rewards)
Gt infinsumk=0
γkRt+k+1
γ is the discount factor 0 lt γ le 1
Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite
Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges
Either way we can insure that Gt is well-defined
Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Policies
Policy π a mapping from state to actions
π(a|s) PrAt = a|St = sCan be deterministic or probabilistic
Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around
Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Value Functions
State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π
I vπ(s) = Eπ
[ infinsumk=0
γkRt+k+1
∣∣∣∣∣St = s
]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]
I Intuitively how rdquogoodrdquo an action a is in state s under policy π
I qπ(s a) = Eπ
[suminfink=0 γ
kRt+k+1
∣∣∣∣∣St = sAt = a
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
MDP Illustration
Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)
This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)
Even only two states and three possible actions can be prettycomplicated to model
Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Some Potentially Useful Value Identities
Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]
Recall qπ(s a) Eπ [Gt |St = sAt = a]
vπ(s) = Eπ [qπ(s a)|St = s] =sum
aisinA(s)
π(a|s)qπ(s a)
qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]
qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]
Common theme value functions are recursive expand one step to seea useful pattern
Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Solving RL Problems
Two main classes of solutions
Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space
Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor
policy functionsI Policy Gradient Methods fall under this category
Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Introducing Policy Gradient
Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ
Measure policy performance with scalar performance measure J(θ)
Maximize J(θ) with gradient ascent on θ
General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)
We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards
For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation
Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
The Policy Gradient Theorem
J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ
The Policy Gradient Theorem states that
nablaJ(θ) = nablaθvπ(s0) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
micro(s) η(s)sumsprime η(s prime)
the on-policy distribution under π
I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s
Constant of proportionality is the average length of an episode
See p 325 of Sutton RL Book (Second Edition) for concise proof
Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE Update
nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
= Eπ
[suma
qπ(St a)nablaπ(a|St θ)
]
= Eπ
[suma
π(a|St θ)qπ(St a)nablaπ(a|St θ)
π(a|St θ)
]
= Eπ[Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]]= Eπ
[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]
Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE Update (Continued)
nablaJ(θ) prop Eπ[qπ(St At)
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ [Gt |St At ]
nablaπ(At |St θ)
π(At |St θ)
]= Eπ
[Eπ
[Gtnablaπ(At |St θ)
π(At |St θ)
∣∣∣∣∣St At
]]
= Eπ[Gtnablaπ(At |St θ)
π(At |St θ)
]
Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)
π(At |St θ)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Quick Aside Monte Carlo
Way of approximating an integralexpectation
Ep(x) [f (x)] =
intxisinX
p(x)f (x)dx
If p(x) andor f (x) is complicated the integral is difficult to solveanalytically
Simple idea sample N values from the domain of the distribution Xand take the average
Ep(x) [f (x)] asymp 1
N
Nsumi=1
p(x (i))f (x (i))
The larger N is the better the estimate
Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE Algorithm
Generate entire trajectories and compute Gt from it
Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α
Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE as a Score Function Gradient Estimator
We can rewrite J(θ) in terms of an expectation over trajectories τ
J(θ) vπ(s0) = Eτ [G0(τ)] =sum
τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ
Recall nablax log f (x) =nablax f (x)
f (x)by chain rule
We can use the log-derivative trick to derive an expression for nablaJ(θ)
In statistics this estimator is called the Score Function Estimator
Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE as a Score Function Gradient Estimator
nablaJ(θ) = nablasumτ
p(τ |πθ)G0(τ)
=sumτ
nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)
p(τ |πθ)nablap(τ |πθ)G0(τ)
=sumτ
p(τ |πθ)nabla log p(τ |πθ)G0(τ)
= Eτ [nabla log p(τ |πθ)G0(τ)]
nablaJ(θ) =1
N
Nsumi=1
log p(τ (i)|πθ)G0(τ (i))
(Monte Carlo approximation)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Properties of REINFORCEScore Function GradientEstimator
ProsI Unbiased E
[nablaJ(θ)
]= J(θ)
I Very general (reward does not need to be continuousdifferentiable)
ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient
Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Reducing REINFORCE Estimator Variance
Recall nablaJ(θ) propsums
micro(s)suma
qπ(s a)nabla(π(a|s θ))
Let b(s) be a baseline (some function of s that does not depend on θ)suma
b(s)nablaπ(a|s θ) = b(s)nablasuma
π(a|s θ) = b(s)nabla1 = 0
Thus nablaJ(θ) propsums
micro(s)suma
(qπ(s a)minus b(s))nabla(π(a|s θ))
New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)
Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w
I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE with Baseline Algorithm
Two different step sizes αθ and αw
v is also estimated with Monte Carlo
Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE with Baseline vs without
Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
Extensions Actor-Critic
Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states
Introduces bias but reduces variance
Can use one-step or n-step return (as opposed to full returns inREINFORCE)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
One-step Actor-Critic Algorithm
Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
OpenAI Gym
Free python library of environments for reinforcement learning
Easy to use good for benchmarking
Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics
Check out the docs and environments
Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
REINFORCE bipedal walker demo
OpenAI Gym Environment
Train a 2D walker to navigate a landscape
Different levels of difficulty
Code taken from Michael Guerzhoyrsquos CSC411 course webpage
Uses policy gradient with REINFORCE
Note requires tensorflow 1x (not 2x)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
References
Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018
Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)
Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)
Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28
- General RL Overview
-
- Agent-Environment Interface
-
- Policy Gradient Methods
-
top related