lecture 8: policy gradient - · pdf filelecture 8: policy gradient outline 1 introduction 2...

Lecture 8: Policy Gradient


Hado van Hasselt


Outline

1 Introduction

2 Finite Difference Policy Gradient

3 Monte-Carlo Policy Gradient

4 Actor-Critic Policy Gradient


Introduction

Vapnik’s rule

Never solve a more general problem as an intermediate step.(Vladimir Vapnik, 1998)

If we care about optimal behaviour: why not learn a policy directly?


Introduction

General overview

Model-based RL:

+ ‘Easy’ to learn a model (supervised learning)- ‘Wrong’ objective- Non-trivial going from model to policy (planning)

Value-based RL:

+ Closer to true objectiveo Harder than learning models, but perhaps easier than policies- Still not exactly right objective

Policy-based RL:

+ Right objective!- Learning & evaluating policies can be inefficient (high variance)- Ignores other learnable knowledge, which may be important


Introduction

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-valuefunction using parameters θ,

vθ(s) ≈ vπ(s)

qθ(s, a) ≈ qπ(s, a)

A policy was generated directly from the value function

e.g. using ε-greedy

In this lecture we will directly parametrize the policy

πθ(a|s) = P [a | s, θ]

We focus on model-free reinforcement learning


Introduction

Value-Based and Policy-Based RL

Value Based

Learnt Value FunctionImplicit policy(e.g. ε-greedy)

Policy Based

No Value FunctionLearnt Policy

Actor-Critic

Learnt Value FunctionLearnt Policy

Value Function Policy

ActorCritic

Value-Based Policy-Based


Introduction

Advantages of Policy-Based RL

Advantages:

Better convergence properties

Effective in high-dimensional or continuous action spaces

Can learn stochastic policies

Sometimes policies are simple while values and models arecomplex

Disadvantages:

Susceptible to local optima

Learning a policy can be slower than learning values


Introduction

Rock-Paper-Scissors Example

Example: Rock-Paper-Scissors

Two-player game of rock-paper-scissors

Scissors beats paperRock beats scissorsPaper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)


Introduction

Aliased Gridworld Example

Example: Aliased Gridworld (1)

The agent cannot differentiate the grey states

Consider features of the following form (for all N, E, S, W)

φ(s, a) = (

walls︷︸︸︷1︸︷︷︸N

0︸︷︷︸E

1︸︷︷︸S

0︸︷︷︸W

actions︷︸︸︷0︸︷︷︸N

1︸︷︷︸E

0︸︷︷︸S

0︸︷︷︸W

)

Compare deterministic and stochastic policies


Introduction



Under aliasing, an optimal deterministic policy will either

move W in both grey states (shown by red arrows)move E in both grey states

Either way, it can get stuck and never reach the money

So it will traverse the corridor for a long time


Introduction



An optimal stochastic policy moves randomly E or W in greystates

πθ(wall to N and S, move E) = 0.5

πθ(wall to N and S, move W) = 0.5

Will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy


Introduction

Policy Search

Policy Objective Functions

Goal: given policy πθ(s, a) with parameters θ, find best θ

But how do we measure the quality of a policy πθ?

In episodic environments we can use the start value

J1(θ) = vπθ(s1)

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)vπθ(s)

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

where dπθ(s) is stationary distribution of Markov chain for πθ


Introduction

Policy Search

Policy Optimisation

Policy based reinforcement learning is an optimization problem

Find θ that maximises J(θ)

Some approaches do not use gradient

Hill climbingGenetic algorithms

Greater efficiency often possible using gradient

Gradient descentConjugate gradientQuasi-newton

We will focus on stochastic gradient descent


Finite Difference Policy Gradient

Policy Gradient

Let J(θ) be any policy objective function

Policy gradient algorithms search for alocal maximum in J(θ) by ascending thegradient of the policy, w.r.t. parameters θ

∆θ = α∇θJ(θ)

Where ∇θJ(θ) is the policy gradient

∇θJ(θ) =

∂J(θ)∂θ1...

∂J(θ)∂θn

and α is a step-size parameter

!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#

!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$'*$%-/0'*,('-*$.'("$("#$1)*%('-*$,22&-3'/,(-& %,*$0#$)+#4$(-$%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$

!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($%,*$*-.$0#$)+#4$(-$)24,(#$("#$'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$

<&,4'#*($4#+%#*($=>?


Monte-Carlo Policy Gradient

Likelihood Ratios

Gradients on parameterized policies

We need to compute an estimate of the policy gradient

Assume policy πθ is differentiable almost everywhere

E.g., πθ is a linear function, or a neural networkOr perhaps we may have a parameterized class of controllers

Goal is to compute

∇θJ(θ) = ∇θEd [vπθ(S)] .

We will use Monte Carlo samples to compute this gradient

So, how does Ed [vπθ(S)] depend on θ?



Likelihood Ratios

Monte Carlo Policy Gradient

Consider a one-step case such that J(θ) = E[R(S ,A)], whereexpectation is over d (states) and π (actions)

We cannot sample Rt+1 and then take a gradient: Rt+1 is justa number that does not depend on θ

Instead, we use the identity:

∇θE[R(S ,A)] = E[∇θ log π(A|S)R(S ,A)] .

(Proof on next slide)

The right-hand side gives an expected gradient that can besampled



Likelihood Ratios


∇θE[R(S ,A)] = E[∇θ log πθ(A|S)R(S ,A)] (see previous slide)

This is something we can sample

Our stochastic policy-gradient update is then

θt+1 = θt + αRt+1∇θ log πθt (At |St) .

In expectation, this is the actual policy gradient

So this is a stochastic gradient algorithm



Likelihood Ratios

Softmax Policy

Consider a softmax policy on linear values as an example

Weight actions using linear combination of features φ(s, a)>θ

Probability of action is proportional to exponentiated weight

πθ(s, a)eφ(s,a)>θ∑b e

φ(s,b)>θ

The gradient of the log probability is

∇θ log πθ(s, a) = φ(s, a)− Eπθ [φ(s, ·)]



Likelihood Ratios

Gaussian Policy

In continuous action spaces, a Gaussian policy is common

E.g., mean is a linear combination of state featuresµ(s) = φ(s)>θ

Lets assume variance may be fixed at σ2

(can be parametrized as well, instead)

Policy is Gaussian, a ∼ N (µ(s), σ2)

The gradient of the log of the policy is then

∇θ log πθ(s, a) =(a− µ(s))φ(s)

σ2



Policy Gradient Theorem


The policy gradient theorem generalizes this approach tomulti-step MDPs

Replaces instantaneous reward R with long-term value qπ(s, a)

Policy gradient theorem applies to start state objective,average reward and average value objective

Theorem

For any differentiable policy πθ(s, a),for any of the policy objective functions J = J1, JavR , or 1

1−γ JavV ,the policy gradient is

∇θJ(θ) = Eπθ [∇θ log πθ(A|S) qπθ(S ,A)]




Monte-Carlo Policy Gradient (REINFORCE)

Using return Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . as anunbiased sample of qπθ(St ,At)

Update parameters by stochastic gradient ascent

∆θt = α∇θ log πθ(At |St)Gt

function REINFORCEInitialise θ arbitrarilyfor each episode {S1,A1,R2, ...,ST−1,AT−1,RT} ∼ πθ do

for t = 1 to T − 1 doθ ← θ + α∇θ log πθ(At |St)Gt

end forend forreturn θ

end function




Labyrinth


Actor-Critic Policy Gradient

Reducing Variance Using a Critic

Monte-Carlo policy gradient can have high variance, becausewe use full return

We can use a critic to estimate the action-value function,

qw (s, a) ≈ qπθ(s, a)

Actor-critic algorithms maintain two sets of parameters

Critic Updates action-value function parameters wActor Updates policy parameters θ, in direction

suggested by critic

One Actor-critic algorithm: follow an approximate policygradient

∇θJ(θ) ≈ Eπθ [∇θ log πθ(s, a) qw (s, a)]

∆θ = α∇θ log πθ(s, a) qw (s, a)



Estimating the Action-Value Function

The critic is solving a familiar problem: policy evaluation

What is the value of policy πθ for current parameters θ?

This problem was explored in previous lectures, e.g.

Monte-Carlo policy evaluationTemporal-Difference learningTD(λ)



Action-Value Actor-Critic

Simple actor-critic algorithm based on action-value critic

Using linear value fn approx. qw (s, a) = φ(s, a)>w

Critic Updates w by linear Sarsa(0)Actor Updates θ by policy gradient

function QACInitialise s, θSample a ∼ πθfor each step do

Sample reward R = Ras ; sample transition S ′ ∼ PA

S,·Sample action A′ ∼ πθ(S ′)w ← w + β(R + γqw (S ′,A′)− qw (S ,A))φ(S ,A) [Sarsa(0)]θ = θ + α∇θ log πθ(s, a)qw (s, a) [Policy gradient update]a← a′, s ← s ′

end forend function



Compatible Function Approximation

Bias in Actor-Critic Algorithms

Approximating the policy gradient introduces bias

A biased policy gradient may not find the right solution

e.g. if qw (s, a) uses aliased features, can we solve gridworldexample?

Luckily, if we choose value function approximation carefully:

we can avoid introducing any biasi.e. we can still follow the exact policy gradient





Theorem (Compatible Function Approximation Theorem)

If the following two conditions are satisfied:

1 Value function approximator is compatible to the policy

∇wqw (s, a) = ∇θ log πθ(s, a)

2 Value function parameters w minimise the mean-squared error

ε = Eπθ[(qπθ(S ,A)− qw (S ,A))2

]Then the policy gradient is exact,

∇θJ(θ) = Eπθ [∇θ log πθ(A|S) qw (S ,A)]




Proof of Compatible Function Approximation Theorem

If w is chosen to minimise mean-squared error, gradient of ε w.r.t.w must be zero,

∇wε = 0

Eπθ[(qπθ

(S ,A)− qw (S ,A))∇wqw (S ,A)] = 0

Eπθ[(qπθ

(S ,A)− qw (S ,A))∇θ log πθ(A|S)] = 0

Eπθ[qπθ

(S ,A)∇θ log πθ(A|S)] = Eπθ[qw (S ,A)∇θ log πθ(A|S)]

So qw (s, a) can be substituted directly into the policy gradient,

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)qw (s, a)]





1 Value function approximator is compatible to the policy

∇wqw (s, a) = ∇θ log πθ(s, a)

2 Value function parameters w minimise the mean-squared error

ε = Eπθ[(qπθ(S ,A)− qw (S ,A))2

]Unfortunately, 1) only holds for certain function classes

Unfortunately, 2) is never quite true

These are typical limitations of theory; be aware of theassumptions



Advantage Function Critic

Reducing Variance Using a Baseline

We subtract a baseline function b(s) from the policy gradient

This can reduce variance, without changing expectation

Eπθ [∇θ log πθ(s, a)b(s)] =∑s∈S

dπθ(s)∑a

∇θπθ(s, a)b(s)

=∑s∈S

dπθb(s)∇θ∑a∈A

πθ(s, a)

= 0

So we can rewrite the policy gradient as

∇θJ(θ) = Eπθ [∇θ log πθ(s, a) (qπθ(s, a)− b(s))]




Estimating the Advantage Function (1)

The advantage function can significantly reduce variance ofpolicy gradient

So the critic should really estimate the advantage function

For example, by estimating both vπθ(s) and qπθ(s, a)

Using two function approximators and two parameter vectors,

vξ(s) ≈ vπθ(s)

qw (s, a) ≈ qπθ(s, a)

A(s, a) = qw (s, a)− vξ(s)

And updating both value functions by e.g. TD learning




Estimating the Advantage Function (2)

For the true value function vπθ(s), the TD error δπθ

δπθ = r + γvπθ(s ′)− vπθ(s)

is an unbiased estimate of the advantage function

Eπθ [δπθ |s, a] = Eπθ [Rt+1 + γvπθ(St+1)|s, a]− vπθ(s)

= qπθ(s, a)− vπθ(s)

= Aπθ(s, a)

So we can use the TD error to compute the policy gradient

∇θJ(θ) = Eπθ [∇θ log πθ(s, a) δπθ ]

In practice we can use an approximate TD error

δv = r + γvξ(s′)− vξ(s)

This approach only requires one set of critic parameters ξ



Eligibility Traces

Critics at Different Time-Scales

Critic can estimate value function vξ(s) from many targets atdifferent time-scalesFrom last lecture...

For MC, the target is the return Gt

∆θ = α(Gt − vξ(s))φ(s)

For TD(0), the target is the TD target r + γvξ(s ′)

∆θ = α(Rt+1 + γvξ(St+1)− vξ(s))φ(s)

For forward-view TD(λ), the target is the λ-return Gλt

∆θ = α(Gλt − vξ(s))φ(s)

For backward-view TD(λ), we use eligibility traces

δt = Rt+1 + γvξ(st+1)− vξ(st)

et = γλet−1 + φ(St)

∆θ = αδtet



Eligibility Traces

Actors at Different Time-Scales

The policy gradient can also be estimated at many time-scales

∇θJ(θ) = Eπθ [∇θ log πθ(s, a) Aπθ(s, a)]

Monte-Carlo policy gradient uses error from complete return

∆θ = α(Gt − vξ(st))∇θ log πθ(st , at)

Actor-critic policy gradient uses the one-step TD error

∆θ = α(Rt+1 + γvξ(st+1)− vξ(st))∇θ log πθ(st , at)



Eligibility Traces

Policy Gradient with Eligibility Traces

Just like forward-view TD(λ), we can mix over time-scales

∆θ = α(vλt − vξ(st))∇θ log πθ(st , at)

Like backward-view TD(λ), we can also use eligibility traces

By equivalence with TD(λ), substituting φ(s) = ∇θ log πθ(s, a)

δt = Rt+1 + γvξ(St+1)− vξ(St)

et+1 = γλet +∇θ log πθ(At |St)∆θ = αδtet

This update can be applied online, to incomplete sequences



Natural Policy Gradient

Alternative Policy Gradient Directions

Gradient ascent algorithms can follow any ascent direction

A good ascent direction can significantly speed convergence

Also, a policy can often be reparametrised without changingaction probabilities

For example, increasing score of all actions in a softmax policy

The vanilla gradient is sensitive to these reparametrisations





The natural policy gradient is parametrisation independent

It finds direction that maximally ascends objective function,when changing policy by a small, fixed amount

∇natθ πθ(s, a) = G−1

θ ∇θπθ(s, a)

where Gθ is the Fisher information matrix

Gθ = Eπθ[∇θ log πθ(s, a)∇θ log πθ(s, a)T

]




Natural Actor-Critic

Using compatible linear function approximation,

φ(s, a) = ∇θ log πθ(s, a)

So the natural policy gradient simplifies,

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Aπθ(s, a)]

= Eπθ[∇θ log πθ(s, a)φ(s, a)Tw

]= Eπθ

[∇θ log πθ(s, a)∇θ log πθ(s, a)T

]w

= Gθw

∇natθ J(θ)= w

i.e. update actor parameters in direction of critic parameters



Snake example

Natural Actor Critic in Snake Domain

The dynamics of the CPG is given by

!(t + 1) = !(t) + !(t), (27)

where ! is the input (corresponding to an angular velocity) to the CPG, either of a constantinput (!0) or an impulsive input (!0 + "!), where "! is fixed at "/15. The snake-likerobot controlled by this CPG goes straight during ! is !0, but turns when an impulsiveinput is added at a certain instance; the turning angle depends on both the CPG’s internalstate ! and the intensity of the impulse. An action is to select an input out of these twocandidates, and the controller represents the action selection probability which is adjustedby RL.

To carry out RL with the CPG, we employ the CPG-actor-critic model [24], in whichthe physical system (robot) and the CPG are regarded as a single dynamical system, calledthe CPG-coupled system, and the control signal for the CPG-coupled system is optimizedby RL.

(a) Crank course (b) Sensor setting

Figure 2: Crank course and sensor setting

4.1 Experimental settings

We here describe the settings for experiments which examine the e#ciency of sample reusageand the exploration ability of our new RL method, o$-NAC.

It is assumed the sensor detects the distance d between the sensor point (red circle inFig.2(b)) and the nearest wall, and the angle # between the proceeding direction and thenearest wall, as depicted in Fig.2(b). The sensor point is the projection of the head link’sposition onto the body axis which passes through the center of mass and is in the proceedingdirection, of the snake-like robot. The angular sensor perceives the wall as to be within[0, 2") [rad] in anti-clockwise manner, and # = 0 indicates that the snake-like robot hits thewall.

16



Snake example

Natural Actor Critic in Snake Domain (2)!bi , the robot soon hit the wall and got stuck (Fig. 3(a)). In contrast, the policy parameter

after learning, !ai , allowed the robot to pass through the crank course repeatedly (Fig. 3(b)).

This result shows a good controller was acquired by our o!-NAC.

2 1 0 1 2 3 4 5 6 7 82

1

0

1

2

3

4

5

6

7

8

Position x (m)

Pisi

tion

y (m

)

(a) Before learning

2 1 0 1 2 3 4 5 6 7 82

1

0

1

2

3

4

5

6

7

8

Position x (m)

Pisi

tion

y (m

)

(b) After learning

Figure 3: Behaviors of snake-like robot

Figure 4 shows the action a (accompanied by the number of aggregated impulses), theinternal state of the CPG, cos("), the relative angle of the wall "̄ (for visibility, the range" ! [#, 2#) was moved to "̄ ! ["#, 0). If "̄ is negative (positive), the wall is on the right-hand-side (left-hand-side) of the robot), and the distance between the robot and the wall,d. When the wall was on the right-hand-side, the impulsive input was applied (a = 1) whencos(") was large; this control was appropriate for walking the robot turn left. In addition,the impulsive input was added many times especially when the distance d became small.

We compare the performance of our o!-NAC with that of the on-NAC [1] [2]. We pre-pared ten initial policy parameters randomly drawn from ["10, 0]4, and they were adjustedby the two RL methods. Learning parameters, such as the learning coe#cient, were besttuned by hand. To examine solely the e#ciency of the reusage of past trajectories here,we did not use any exploratory behavior policies ($ = 0), i.e., trajectories were generatedaccording to the current policy in each episode. The maximum number of stored samplesequences (K) was set at 10, and the number of initially-stored sample sequences (K0) wasset at 0. Eligibility % was set at 0.999. We regarded a learning run as successful when theactor was improved such to allow the robot to pass through the crank course twice withoutany collision with the wall, after 500 learning episodes.

Table 1 shows the number of successful learning runs and the average number of episodesuntil the robot was first successful in passing through the crank course. With respect toboth of the criteria, our o!-NAC overwhelmed the on-NAC.

Figure 5 shows a typical learning run by the o!-NAC (Fig. 5(1a-c)) and the best runby the on-NAC (Fig. 5(2a-c)). Panels (a), (b) and (c) show the average reward in a single

18



Snake example

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms

∇θJ(θ) = Eπθ[∇θ log πθ(A|S) Gt ] REINFORCE

= Eπθ[∇θ log πθ(A|S) qw (S ,A)] Q Actor-Critic

= Eπθ[∇θ log πθ(A|S) Aw (S ,A)] Advantage Actor-Critic

= Eπθ[∇θ log πθ(A|S) δt ] TD Actor-Critic

= Eπθ[δtet ] TD(λ) Actor-Critic

G−1θ ∇θJ(θ) = w (linear) Natural Actor-Critic

Each leads a stochastic gradient ascent algorithm

Critic uses policy evaluation (e.g. MC or TD learning)to estimate qπ(s, a), Aπ(s, a) or vπ(s)

lecture 8: policy gradient - · pdf filelecture 8: policy gradient outline 1 introduction 2...

Documents