lecture 8: policy gradient - · pdf filelecture 8: policy gradient outline 1 introduction 2...
TRANSCRIPT
Lecture 8: Policy Gradient
Lecture 8: Policy Gradient
Hado van Hasselt
Lecture 8: Policy Gradient
Outline
1 Introduction
2 Finite Difference Policy Gradient
3 Monte-Carlo Policy Gradient
4 Actor-Critic Policy Gradient
Lecture 8: Policy Gradient
Introduction
Vapnik’s rule
Never solve a more general problem as an intermediate step.(Vladimir Vapnik, 1998)
If we care about optimal behaviour: why not learn a policy directly?
Lecture 8: Policy Gradient
Introduction
General overview
Model-based RL:
+ ‘Easy’ to learn a model (supervised learning)- ‘Wrong’ objective- Non-trivial going from model to policy (planning)
Value-based RL:
+ Closer to true objectiveo Harder than learning models, but perhaps easier than policies- Still not exactly right objective
Policy-based RL:
+ Right objective!- Learning & evaluating policies can be inefficient (high variance)- Ignores other learnable knowledge, which may be important
Lecture 8: Policy Gradient
Introduction
Policy-Based Reinforcement Learning
In the last lecture we approximated the value or action-valuefunction using parameters θ,
vθ(s) ≈ vπ(s)
qθ(s, a) ≈ qπ(s, a)
A policy was generated directly from the value function
e.g. using ε-greedy
In this lecture we will directly parametrize the policy
πθ(a|s) = P [a | s, θ]
We focus on model-free reinforcement learning
Lecture 8: Policy Gradient
Introduction
Value-Based and Policy-Based RL
Value Based
Learnt Value FunctionImplicit policy(e.g. ε-greedy)
Policy Based
No Value FunctionLearnt Policy
Actor-Critic
Learnt Value FunctionLearnt Policy
Value Function Policy
ActorCritic
Value-Based Policy-Based
Lecture 8: Policy Gradient
Introduction
Advantages of Policy-Based RL
Advantages:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies
Sometimes policies are simple while values and models arecomplex
Disadvantages:
Susceptible to local optima
Learning a policy can be slower than learning values
Lecture 8: Policy Gradient
Introduction
Rock-Paper-Scissors Example
Example: Rock-Paper-Scissors
Two-player game of rock-paper-scissors
Scissors beats paperRock beats scissorsPaper beats rock
Consider policies for iterated rock-paper-scissors
A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)
Lecture 8: Policy Gradient
Introduction
Aliased Gridworld Example
Example: Aliased Gridworld (1)
The agent cannot differentiate the grey states
Consider features of the following form (for all N, E, S, W)
φ(s, a) = (
walls︷ ︸︸ ︷1︸︷︷︸N
0︸︷︷︸E
1︸︷︷︸S
0︸︷︷︸W
actions︷ ︸︸ ︷0︸︷︷︸N
1︸︷︷︸E
0︸︷︷︸S
0︸︷︷︸W
)
Compare deterministic and stochastic policies
Lecture 8: Policy Gradient
Introduction
Aliased Gridworld Example
Example: Aliased Gridworld (2)
Under aliasing, an optimal deterministic policy will either
move W in both grey states (shown by red arrows)move E in both grey states
Either way, it can get stuck and never reach the money
So it will traverse the corridor for a long time
Lecture 8: Policy Gradient
Introduction
Aliased Gridworld Example
Example: Aliased Gridworld (3)
An optimal stochastic policy moves randomly E or W in greystates
πθ(wall to N and S, move E) = 0.5
πθ(wall to N and S, move W) = 0.5
Will reach the goal state in a few steps with high probability
Policy-based RL can learn the optimal stochastic policy
Lecture 8: Policy Gradient
Introduction
Policy Search
Policy Objective Functions
Goal: given policy πθ(s, a) with parameters θ, find best θ
But how do we measure the quality of a policy πθ?
In episodic environments we can use the start value
J1(θ) = vπθ(s1)
In continuing environments we can use the average value
JavV (θ) =∑s
dπθ(s)vπθ(s)
Or the average reward per time-step
JavR(θ) =∑s
dπθ(s)∑a
πθ(s, a)Ras
where dπθ(s) is stationary distribution of Markov chain for πθ
Lecture 8: Policy Gradient
Introduction
Policy Search
Policy Optimisation
Policy based reinforcement learning is an optimization problem
Find θ that maximises J(θ)
Some approaches do not use gradient
Hill climbingGenetic algorithms
Greater efficiency often possible using gradient
Gradient descentConjugate gradientQuasi-newton
We will focus on stochastic gradient descent
Lecture 8: Policy Gradient
Finite Difference Policy Gradient
Policy Gradient
Let J(θ) be any policy objective function
Policy gradient algorithms search for alocal maximum in J(θ) by ascending thegradient of the policy, w.r.t. parameters θ
∆θ = α∇θJ(θ)
Where ∇θJ(θ) is the policy gradient
∇θJ(θ) =
∂J(θ)∂θ1...
∂J(θ)∂θn
and α is a step-size parameter
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$'*$%-/0'*,('-*$.'("$("#$1)*%('-*$,22&-3'/,(-& %,*$0#$)+#4$(-$%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($%,*$*-.$0#$)+#4$(-$)24,(#$("#$'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Gradients on parameterized policies
We need to compute an estimate of the policy gradient
Assume policy πθ is differentiable almost everywhere
E.g., πθ is a linear function, or a neural networkOr perhaps we may have a parameterized class of controllers
Goal is to compute
∇θJ(θ) = ∇θEd [vπθ(S)] .
We will use Monte Carlo samples to compute this gradient
So, how does Ed [vπθ(S)] depend on θ?
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Monte Carlo Policy Gradient
Consider a one-step case such that J(θ) = E[R(S ,A)], whereexpectation is over d (states) and π (actions)
We cannot sample Rt+1 and then take a gradient: Rt+1 is justa number that does not depend on θ
Instead, we use the identity:
∇θE[R(S ,A)] = E[∇θ log π(A|S)R(S ,A)] .
(Proof on next slide)
The right-hand side gives an expected gradient that can besampled
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Monte Carlo Policy Gradient
∇θE[R(S ,A)] = ∇θ∑s
d(s)∑a
πθ(a|s)R(s, a)
=∑s
d(s)∑a
∇θπθ(a|s)R(s, a)
=∑s
d(s)∑a
πθ(a|s)∇θπθ(a|s)
πθ(a|s)R(s, a)
=∑s
d(s)∑a
πθ(a|s)∇θ log πθ(a|s)R(s, a)
= E[∇θ log πθ(A|S)R(S ,A)]
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Monte Carlo Policy Gradient
∇θE[R(S ,A)] = E[∇θ log πθ(A|S)R(S ,A)] (see previous slide)
This is something we can sample
Our stochastic policy-gradient update is then
θt+1 = θt + αRt+1∇θ log πθt (At |St) .
In expectation, this is the actual policy gradient
So this is a stochastic gradient algorithm
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Softmax Policy
Consider a softmax policy on linear values as an example
Weight actions using linear combination of features φ(s, a)>θ
Probability of action is proportional to exponentiated weight
πθ(s, a)eφ(s,a)>θ∑b e
φ(s,b)>θ
The gradient of the log probability is
∇θ log πθ(s, a) = φ(s, a)− Eπθ [φ(s, ·)]
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Likelihood Ratios
Gaussian Policy
In continuous action spaces, a Gaussian policy is common
E.g., mean is a linear combination of state featuresµ(s) = φ(s)>θ
Lets assume variance may be fixed at σ2
(can be parametrized as well, instead)
Policy is Gaussian, a ∼ N (µ(s), σ2)
The gradient of the log of the policy is then
∇θ log πθ(s, a) =(a− µ(s))φ(s)
σ2
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Policy Gradient Theorem
Policy Gradient Theorem
The policy gradient theorem generalizes this approach tomulti-step MDPs
Replaces instantaneous reward R with long-term value qπ(s, a)
Policy gradient theorem applies to start state objective,average reward and average value objective
Theorem
For any differentiable policy πθ(s, a),for any of the policy objective functions J = J1, JavR , or 1
1−γ JavV ,the policy gradient is
∇θJ(θ) = Eπθ [∇θ log πθ(A|S) qπθ(S ,A)]
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Policy Gradient Theorem
Monte-Carlo Policy Gradient (REINFORCE)
Using return Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . as anunbiased sample of qπθ(St ,At)
Update parameters by stochastic gradient ascent
∆θt = α∇θ log πθ(At |St)Gt
function REINFORCEInitialise θ arbitrarilyfor each episode {S1,A1,R2, ...,ST−1,AT−1,RT} ∼ πθ do
for t = 1 to T − 1 doθ ← θ + α∇θ log πθ(At |St)Gt
end forend forreturn θ
end function
Lecture 8: Policy Gradient
Monte-Carlo Policy Gradient
Policy Gradient Theorem
Labyrinth
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Reducing Variance Using a Critic
Monte-Carlo policy gradient can have high variance, becausewe use full return
We can use a critic to estimate the action-value function,
qw (s, a) ≈ qπθ(s, a)
Actor-critic algorithms maintain two sets of parameters
Critic Updates action-value function parameters wActor Updates policy parameters θ, in direction
suggested by critic
One Actor-critic algorithm: follow an approximate policygradient
∇θJ(θ) ≈ Eπθ [∇θ log πθ(s, a) qw (s, a)]
∆θ = α∇θ log πθ(s, a) qw (s, a)
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Estimating the Action-Value Function
The critic is solving a familiar problem: policy evaluation
What is the value of policy πθ for current parameters θ?
This problem was explored in previous lectures, e.g.
Monte-Carlo policy evaluationTemporal-Difference learningTD(λ)
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Action-Value Actor-Critic
Simple actor-critic algorithm based on action-value critic
Using linear value fn approx. qw (s, a) = φ(s, a)>w
Critic Updates w by linear Sarsa(0)Actor Updates θ by policy gradient
function QACInitialise s, θSample a ∼ πθfor each step do
Sample reward R = Ras ; sample transition S ′ ∼ PA
S,·Sample action A′ ∼ πθ(S ′)w ← w + β(R + γqw (S ′,A′)− qw (S ,A))φ(S ,A) [Sarsa(0)]θ = θ + α∇θ log πθ(s, a)qw (s, a) [Policy gradient update]a← a′, s ← s ′
end forend function
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Compatible Function Approximation
Bias in Actor-Critic Algorithms
Approximating the policy gradient introduces bias
A biased policy gradient may not find the right solution
e.g. if qw (s, a) uses aliased features, can we solve gridworldexample?
Luckily, if we choose value function approximation carefully:
we can avoid introducing any biasi.e. we can still follow the exact policy gradient
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Compatible Function Approximation
Compatible Function Approximation
Theorem (Compatible Function Approximation Theorem)
If the following two conditions are satisfied:
1 Value function approximator is compatible to the policy
∇wqw (s, a) = ∇θ log πθ(s, a)
2 Value function parameters w minimise the mean-squared error
ε = Eπθ[(qπθ(S ,A)− qw (S ,A))2
]Then the policy gradient is exact,
∇θJ(θ) = Eπθ [∇θ log πθ(A|S) qw (S ,A)]
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Compatible Function Approximation
Proof of Compatible Function Approximation Theorem
If w is chosen to minimise mean-squared error, gradient of ε w.r.t.w must be zero,
∇wε = 0
Eπθ[(qπθ
(S ,A)− qw (S ,A))∇wqw (S ,A)] = 0
Eπθ[(qπθ
(S ,A)− qw (S ,A))∇θ log πθ(A|S)] = 0
Eπθ[qπθ
(S ,A)∇θ log πθ(A|S)] = Eπθ[qw (S ,A)∇θ log πθ(A|S)]
So qw (s, a) can be substituted directly into the policy gradient,
∇θJ(θ) = Eπθ [∇θ log πθ(s, a)qw (s, a)]
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Compatible Function Approximation
Compatible Function Approximation
1 Value function approximator is compatible to the policy
∇wqw (s, a) = ∇θ log πθ(s, a)
2 Value function parameters w minimise the mean-squared error
ε = Eπθ[(qπθ(S ,A)− qw (S ,A))2
]Unfortunately, 1) only holds for certain function classes
Unfortunately, 2) is never quite true
These are typical limitations of theory; be aware of theassumptions
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Advantage Function Critic
Reducing Variance Using a Baseline
We subtract a baseline function b(s) from the policy gradient
This can reduce variance, without changing expectation
Eπθ [∇θ log πθ(s, a)b(s)] =∑s∈S
dπθ(s)∑a
∇θπθ(s, a)b(s)
=∑s∈S
dπθb(s)∇θ∑a∈A
πθ(s, a)
= 0
So we can rewrite the policy gradient as
∇θJ(θ) = Eπθ [∇θ log πθ(s, a) (qπθ(s, a)− b(s))]
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Advantage Function Critic
Estimating the Advantage Function (1)
The advantage function can significantly reduce variance ofpolicy gradient
So the critic should really estimate the advantage function
For example, by estimating both vπθ(s) and qπθ(s, a)
Using two function approximators and two parameter vectors,
vξ(s) ≈ vπθ(s)
qw (s, a) ≈ qπθ(s, a)
A(s, a) = qw (s, a)− vξ(s)
And updating both value functions by e.g. TD learning
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Advantage Function Critic
Estimating the Advantage Function (2)
For the true value function vπθ(s), the TD error δπθ
δπθ = r + γvπθ(s ′)− vπθ(s)
is an unbiased estimate of the advantage function
Eπθ [δπθ |s, a] = Eπθ [Rt+1 + γvπθ(St+1)|s, a]− vπθ(s)
= qπθ(s, a)− vπθ(s)
= Aπθ(s, a)
So we can use the TD error to compute the policy gradient
∇θJ(θ) = Eπθ [∇θ log πθ(s, a) δπθ ]
In practice we can use an approximate TD error
δv = r + γvξ(s′)− vξ(s)
This approach only requires one set of critic parameters ξ
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Eligibility Traces
Critics at Different Time-Scales
Critic can estimate value function vξ(s) from many targets atdifferent time-scalesFrom last lecture...
For MC, the target is the return Gt
∆θ = α(Gt − vξ(s))φ(s)
For TD(0), the target is the TD target r + γvξ(s ′)
∆θ = α(Rt+1 + γvξ(St+1)− vξ(s))φ(s)
For forward-view TD(λ), the target is the λ-return Gλt
∆θ = α(Gλt − vξ(s))φ(s)
For backward-view TD(λ), we use eligibility traces
δt = Rt+1 + γvξ(st+1)− vξ(st)
et = γλet−1 + φ(St)
∆θ = αδtet
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Eligibility Traces
Actors at Different Time-Scales
The policy gradient can also be estimated at many time-scales
∇θJ(θ) = Eπθ [∇θ log πθ(s, a) Aπθ(s, a)]
Monte-Carlo policy gradient uses error from complete return
∆θ = α(Gt − vξ(st))∇θ log πθ(st , at)
Actor-critic policy gradient uses the one-step TD error
∆θ = α(Rt+1 + γvξ(st+1)− vξ(st))∇θ log πθ(st , at)
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Eligibility Traces
Policy Gradient with Eligibility Traces
Just like forward-view TD(λ), we can mix over time-scales
∆θ = α(vλt − vξ(st))∇θ log πθ(st , at)
Like backward-view TD(λ), we can also use eligibility traces
By equivalence with TD(λ), substituting φ(s) = ∇θ log πθ(s, a)
δt = Rt+1 + γvξ(St+1)− vξ(St)
et+1 = γλet +∇θ log πθ(At |St)∆θ = αδtet
This update can be applied online, to incomplete sequences
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Natural Policy Gradient
Alternative Policy Gradient Directions
Gradient ascent algorithms can follow any ascent direction
A good ascent direction can significantly speed convergence
Also, a policy can often be reparametrised without changingaction probabilities
For example, increasing score of all actions in a softmax policy
The vanilla gradient is sensitive to these reparametrisations
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Natural Policy Gradient
Natural Policy Gradient
The natural policy gradient is parametrisation independent
It finds direction that maximally ascends objective function,when changing policy by a small, fixed amount
∇natθ πθ(s, a) = G−1
θ ∇θπθ(s, a)
where Gθ is the Fisher information matrix
Gθ = Eπθ[∇θ log πθ(s, a)∇θ log πθ(s, a)T
]
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Natural Policy Gradient
Natural Actor-Critic
Using compatible linear function approximation,
φ(s, a) = ∇θ log πθ(s, a)
So the natural policy gradient simplifies,
∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Aπθ(s, a)]
= Eπθ[∇θ log πθ(s, a)φ(s, a)Tw
]= Eπθ
[∇θ log πθ(s, a)∇θ log πθ(s, a)T
]w
= Gθw
∇natθ J(θ)= w
i.e. update actor parameters in direction of critic parameters
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Snake example
Natural Actor Critic in Snake Domain
The dynamics of the CPG is given by
!(t + 1) = !(t) + !(t), (27)
where ! is the input (corresponding to an angular velocity) to the CPG, either of a constantinput (!0) or an impulsive input (!0 + "!), where "! is fixed at "/15. The snake-likerobot controlled by this CPG goes straight during ! is !0, but turns when an impulsiveinput is added at a certain instance; the turning angle depends on both the CPG’s internalstate ! and the intensity of the impulse. An action is to select an input out of these twocandidates, and the controller represents the action selection probability which is adjustedby RL.
To carry out RL with the CPG, we employ the CPG-actor-critic model [24], in whichthe physical system (robot) and the CPG are regarded as a single dynamical system, calledthe CPG-coupled system, and the control signal for the CPG-coupled system is optimizedby RL.
(a) Crank course (b) Sensor setting
Figure 2: Crank course and sensor setting
4.1 Experimental settings
We here describe the settings for experiments which examine the e#ciency of sample reusageand the exploration ability of our new RL method, o$-NAC.
It is assumed the sensor detects the distance d between the sensor point (red circle inFig.2(b)) and the nearest wall, and the angle # between the proceeding direction and thenearest wall, as depicted in Fig.2(b). The sensor point is the projection of the head link’sposition onto the body axis which passes through the center of mass and is in the proceedingdirection, of the snake-like robot. The angular sensor perceives the wall as to be within[0, 2") [rad] in anti-clockwise manner, and # = 0 indicates that the snake-like robot hits thewall.
16
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Snake example
Natural Actor Critic in Snake Domain (2)!bi , the robot soon hit the wall and got stuck (Fig. 3(a)). In contrast, the policy parameter
after learning, !ai , allowed the robot to pass through the crank course repeatedly (Fig. 3(b)).
This result shows a good controller was acquired by our o!-NAC.
2 1 0 1 2 3 4 5 6 7 82
1
0
1
2
3
4
5
6
7
8
Position x (m)
Pisi
tion
y (m
)
(a) Before learning
2 1 0 1 2 3 4 5 6 7 82
1
0
1
2
3
4
5
6
7
8
Position x (m)
Pisi
tion
y (m
)
(b) After learning
Figure 3: Behaviors of snake-like robot
Figure 4 shows the action a (accompanied by the number of aggregated impulses), theinternal state of the CPG, cos("), the relative angle of the wall "̄ (for visibility, the range" ! [#, 2#) was moved to "̄ ! ["#, 0). If "̄ is negative (positive), the wall is on the right-hand-side (left-hand-side) of the robot), and the distance between the robot and the wall,d. When the wall was on the right-hand-side, the impulsive input was applied (a = 1) whencos(") was large; this control was appropriate for walking the robot turn left. In addition,the impulsive input was added many times especially when the distance d became small.
We compare the performance of our o!-NAC with that of the on-NAC [1] [2]. We pre-pared ten initial policy parameters randomly drawn from ["10, 0]4, and they were adjustedby the two RL methods. Learning parameters, such as the learning coe#cient, were besttuned by hand. To examine solely the e#ciency of the reusage of past trajectories here,we did not use any exploratory behavior policies ($ = 0), i.e., trajectories were generatedaccording to the current policy in each episode. The maximum number of stored samplesequences (K) was set at 10, and the number of initially-stored sample sequences (K0) wasset at 0. Eligibility % was set at 0.999. We regarded a learning run as successful when theactor was improved such to allow the robot to pass through the crank course twice withoutany collision with the wall, after 500 learning episodes.
Table 1 shows the number of successful learning runs and the average number of episodesuntil the robot was first successful in passing through the crank course. With respect toboth of the criteria, our o!-NAC overwhelmed the on-NAC.
Figure 5 shows a typical learning run by the o!-NAC (Fig. 5(1a-c)) and the best runby the on-NAC (Fig. 5(2a-c)). Panels (a), (b) and (c) show the average reward in a single
18
Lecture 8: Policy Gradient
Actor-Critic Policy Gradient
Snake example
Summary of Policy Gradient Algorithms
The policy gradient has many equivalent forms
∇θJ(θ) = Eπθ[∇θ log πθ(A|S) Gt ] REINFORCE
= Eπθ[∇θ log πθ(A|S) qw (S ,A)] Q Actor-Critic
= Eπθ[∇θ log πθ(A|S) Aw (S ,A)] Advantage Actor-Critic
= Eπθ[∇θ log πθ(A|S) δt ] TD Actor-Critic
= Eπθ[δtet ] TD(λ) Actor-Critic
G−1θ ∇θJ(θ) = w (linear) Natural Actor-Critic
Each leads a stochastic gradient ascent algorithm
Critic uses policy evaluation (e.g. MC or TD learning)to estimate qπ(s, a), Aπ(s, a) or vπ(s)