policy gradient methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74...

74
Policy Gradient Methods http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html Acknowledgement: this slides is based on Prof. Shipra Agrawal’s lecture notes 1/74

Upload: others

Post on 11-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

Policy Gradient Methods

http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html

Acknowledgement: this slides is based on Prof. Shipra Agrawal’s lecture notes

1/74

Page 2: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

2/74

Outline

1 Policy gradient methodsFinite horizon MDPInfinite horizon case

2 Actor-critic methods

3 TRPO and PPO

4 AlphaGo Zero

Page 3: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

3/74

Policy gradient methods

In Q-learning function approximation was used to approximateQ-function, and policy was a greedy policy based on estimatedQ-function. In policy gradient methods, we approximate astochastic policy directly using a parametric functionapproximator.More formally, given an MDP (S,A, s1,R,P), let πθ : S→ ∆A

denote a randomized policy parameterized by parameter vectorθ ∈ Rd For a scalable formulation, we want d << |S|.For example, the policy πθ might be represented by a neuralnetwork whose input is a representation of the state, whoseoutput is action selection probabilities, and whose weights formthe policy parameters θ. (The architecture for such a [deep]neural network is similar to that for a multi-label classifier, withinput being a state, and labels being different actions. Thenetwork should be trained to predict the probability of differentactions given an input state).

Page 4: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

4/74

Policy gradient methods

For simplicity, assume that πθ is differentiable with respect to θ,i.e. ∂πθ(s,a)

∂θ exists. This is true for example, if a neural networkwith differentiable activation functions is used to define πθ. Letρ (πθ) denote the gain of policy πθ.This may be defined as longterm average reward, long term discounted reward or totalreward in an episode or finite horizon. Therefore, solving foroptimal policy reduces to the problem of solving

maxθρ (πθ)

In order to use stochastic gradient descent algorithm for finding astationary point of the above problem, we need to compute (anunbiased) estimate of gradient of ρ (πθ) with respect to θ.

Page 5: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

5/74

Finite horizon MDP

Here performance measure to optimize is total expected rewardover a finite horizon H.

ρ(π) = E

[H∑

t=1

γt−1rt|π, s1

]

Let π(s, a) denote the probability of action a in state s forrandomized policy π. Let Dπ(τ) denote the probability distributionof a trajectory (state-action sequence)τ = (s1, a1, s2, . . . , aH−1, sH) of states on starting from state s1 andfollowing policy π. That is,

Dπ(τ) :=

H−1∏i=1

π (si, ai) P (si, ai, si+1)

Page 6: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

6/74

Finite horizon MDP

Theorem 2For finite horizon MDP (S,A, s1,P,R,H), let R(τ) be the total rewardfor an sample trajectory τ , on following πθ for H steps, starting fromstate s1. Then,

∇θρ (πθ) = Eτ [R(τ)∇θ log (Dπθ(τ))] = Eτ

[R(τ)

H−1∑t=1

∇θ log (πθ (st, at))

]

Proof. Let R(τ) be expected total reward for an entire sampletrajectory τ , on following πθ for H steps, starting from states1. That is,given a sample trajectory τ = (s1, a1, s2, . . . , aH−1, sH) from distributionDπθ ,

R(τ) :=

H−1∑t=1

γt−1R (st, at)

Page 7: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

7/74

Proof of Theorem 2

Then,ρ (πθ) = Eτ∼Dπθ[R(τ)]

Now, (the calculations below implicitly assume finite state and actionspace, so that the distribution D(τ) has a finite support)

∂ρ (πθ)

∂θ=

∂θEτ∼Dπθ [R(τ)]

=∂

∂θ

∑τ :Dπθ (τ)>0

Dπθ(τ)R(τ)

=∑

τ :Dπθ (τ)>0

Dπθ(τ)∂

∂θlog (Dπθ(τ)) R(τ)

= Eτ∼Dπθ

[∂

∂θlog (Dπθ(τ)) R(τ)

]

Page 8: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

8/74

Proof of Theorem 2

Further, for a given sample trajectory τ i.

∇θ log(Dπθ

(τ i)) =

H−1∑t=1

∇θ log(πθ(si

t, ait))

+∇θ log P(si

t, ait, s

it+1)

=

H−1∑t=1

∇θ log(πθ(si

t, ait))

Page 9: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

9/74

Finite horizon MDP

The gradient representation given by above theorem is extremelyuseful, as given a sample trajectory this can be computed onlyusing the policy parameter, and does not require knowledge ofthe transition model P(·, ·, ·)! This does seem to requireknowledge of reward model, but that can be handled byreplacing R

(τ i)

by R(τ i)

= r1 + γr2 + . . . , γH−2rH−1, the total ofsample rewards observed in this trajectory.Since, given a trajectory τ , the quantity Dπθ(τ) is determined,and E[R(τ)|τ ] = R(τ)

∇θρ (πθ) = Eτ [R(τ)∇θ log (Dπθ(τ))]

= Eτ[R(τ)∇θ log (Dπθ(τ))

]= Eτ

[R(τ)

H−1∑t=1

∇θ log (πθ (st, at))

]

Page 10: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

10/74

Unbiased estimator of gradient from samples

From above, given sample trajectories τ i, i = 1, . . . ,m, anunbiased estimator for gradient ∇θρ (πθ) is given as:

g =1m

m∑i=1

R(τ i)∇θ log

(Dπθ

(τ i))

=1m

m∑i=1

R(τ i) H−1∑

t=1

∇θ log(πθ(si

t, ait))

Page 11: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

11/74

Baseline

Note that for any constant b (or b that is conditionallyindependent of sampling from πθ given θ), we have:

Eτ[

b∂

∂θjlog (Dπθ(τ)) |θ, s1

]= b

∫τ

∂θj(Dπθ(τ)) = b

∂θj

∫τ

Dπθ(τ) = 0

Therefore, choosing any ’baseline’ b, following is also anunbiased estimator of the ∇θρ (πθ):

g =1m

m∑i=1

H−1∑t=1

(R(τ i)− b

)∇θ log

(πθ(si

t, ait))

Page 12: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

12/74

Baseline

Or, more generally, one could even use a state and timedependent baseline bt

(si

t)

conditionally is independent ofsampling from πθ given si

t, θ, to get estimator:

g =1m

m∑i=1

H−1∑t=1

(R(τ i)− bt

(si

t))∇θ log

(πθ(si

t, ait))

(1)

Below we show this is unbiased. The expectations below areover trajectories (s1, a1, . . . , aH−1, sH), where at ∼ π (st, ·), givenst. For any fixed θ, t, the baseline bt (st) |st needs to bedeterministic or independent of at|st. For simplicity we assume itis deterministic.

Page 13: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

13/74

Baseline

[H−1∑t=1

bt (st)∂

∂θjlog (πθ (st, at)) |θ, s1

]

= E

[H−1∑t=1

E[

bt (st)∂

∂θjlog (πθ (st, at)) |st

]|θ, s1

]

= E

[H−1∑t=1

bt (st)E[∂

∂θjlog (πθ (st, at)) |st

]|θ, s1

]

= E

[H−1∑t=1

bt (st)∑

a

πθ (st, a)∂

∂θjlog (πθ (st, a)) |θ, s1

]

= E

[H−1∑t=1

bt (st)∑

a

∂θjπθ (st, a) |θ, s1

]

Page 14: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

14/74

Baseline

= E

[H−1∑t=1

bt (st)∂

∂θj

∑a

πθ (st, a) | θ, s1

]

= E

[H−1∑t=1

bt (st)∂

∂θj1 | θ, s1

]= 0

An example of such state dependent baseline bt(s), given s andθ, is VπθH−t(s), i.e., the value of policy πθ, starting from state s attime t. We will see later that such a baseline is useful in reducingthe variance of gradient estimates.

Page 15: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

15/74

Vanilla policy gradient algorithm

Initialize policy parameter θ, and baseline.In each iteration,

Execute current policy πθ to obtain several sample trajectories τ i,i = 1, . . . ,m.Use these sample trajectories and chosen baseline to computethe gradient estimator g as in (1)Update θ ← θ + αg

Update baseline as required.Above is essentially same as the REINFORCE algorithm introducedby [Williams, 1988, 1992].

Page 16: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

16/74

Long term average reward

ρ(π): long-term average reward of (randomized) policy π

ρπ = limT→∞

1TE [r1 + r2 + . . .+ rT |π] =

∑s

dπ(s)∑

a

π(s, a)R(s, a)

where dπ(s) = limt→∞ Pr (st = s|s1, π), the stationary distribution forpolicy π, is assumed to exist and independent of the starting state s1for all policies π. The value of a state given a policy π is defined as:

Vπ(s) = limT→∞

E

[T∑

t=1

(rt − ρ(π)) |s1 = s, at = π (st) , t = 1, . . . ,T

]=

∑a

π (s1, a) Qπ (s1, a)

where,

Qπ(s, a) := limT→∞

E

[T∑

t=1

(rt − ρ(π)) |s1 = s, a1 = a, at = π (st) , t = 2, . . . ,T

]

Page 17: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

17/74

Discounted rewards

ρ (π, s1) = limT→∞

E

[T∑

t=1

γt−1rt|π, s1

]=∑

s

dπ(s)∑

a

π(s, a)R(s, a)

where dπ(s) = limT→∞∑T

t=1 γt−1 Pr (st = s|s1, π), and the value of a

state given a policy π is defined as:

Vπ(s) = limT→∞

E

[T∑

t=1

γt−1rt|s1 = s, at = π (st) , t = 1, . . . ,T

]=

∑a

π (s1, a) Qπ (s1, a)

where

Qπ(s, a) := limT→∞

E

[T∑

t=1

γt−1rt|s1 = s, a1 = a, at = π (st) , t = 2, . . . ,T

]

Page 18: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

18/74

Policy gradient theorem

Theorem 3. [Policy gradient theorem [Sutton et al., 1999]]For infinite horizon MDP (average or discounted),

∇θρ (πθ, s1) =∑

s

dπθ(s)∑

a

Qπθ(s, a)∇θπθ(s, a)

=∑

s

dπθ(s)(Ea∼π(s) [Qπθ(s, a)∇θ log (πθ(s, a))]

)That is gradient of gain with respect to θ can be expressed in terms ofgradient of policy function with respect to θ.

Page 19: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

19/74

Proof of Policy gradient theorem

Proof. Average reward formulation.

Qπ(s, a) = R(s, a) +∑

s′P(s, a, s′

)Vπ(s′)− ρ(π)

Vπ(s) =∑

a

π(s, a)Qπ(s, a)

∇θVπ(s) =∑

a

Qπ(s, a)∇θπ(s, a) + π(s, a)∇θQπ(s, a)

=∑

a

Qπ(s, a)∇θπ(s, a)

+∑

a

π(s, a)

[∑s′

P(s, a, s′

)∇θVπ

(s′)−∇θρ(π)

]∇θρ(π) =

∑a

Qπ(s, a)∇θπ(s, a)

+∑

a

π(s, a)

[∑s′

P(s, a, s′

)∇θVπ

(s′)]−∇θVπ(s)

Page 20: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

20/74

Proof of Policy gradient theorem

∑s

dπ(s)∇θρ(π) =∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

+∑

s

dπ(s)∑

a

π(s, a)[∑

s′P(s, a, s′

)∇θVπ

(s′)]

−∑

s

dπ(s)∇θVπ(s)

since dπ is the stationary distribution of π

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

+∑

s′dπ(s′)∇θVπ

(s′)−∑

s

dπ(s)∇θVπ(s)

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

Page 21: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

21/74

Proof of Policy gradient theorem

Discounted reward formulation.

Qπ(s, a) = R(s, a) + γ∑

s′P(s, a, s′

)Vπ(s′)

By Bellman equations:

Vπ(s) =∑

a

π(s, a)Qπ(s, a)

∇θVπ(s) =∑

a

Qπ(s, a)∇θπ(s, a) + π(s, a)∇θQπ(s, a)

Let Pr(s→ x, k, π) is the probability of going from state s to state x in ksteps under policy π.

∑s

dπ(s)∇θVπ(s) :=∑

s

( ∞∑k=0

γk Pr (s1 → s, k, π)

)∇θVπ(s)

Page 22: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

22/74

Proof of Policy gradient theorem

=∑

s

( ∞∑k=0

γk Pr (s1 → s, k, π)

)·(∑

a

Qπ(s, a)∇θπ(s, a) + γ∑

a

π(s, a)∑

s′P(s, a, s′

)∇θVπ

(s′))

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

+∑

s′

(∑s

∞∑k=0

γk Pr (s1 → s, k, π)

)γ∑

a

π(s, a)P(s, a, s′

)∇θVπ

(s′)

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

+∑

s′

( ∞∑k=0

γk+1 Pr(s1 → s′, k + 1, π

))∇θVπ

(s′)

Page 23: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

23/74

Proof of Policy gradient theorem

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a) +∑

s′(dπ

(s′)

−Pr(s1 → s′, 0, π

)∇θVπ)

(s′)

=∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a) +∑

s′dπ(s′)∇θVπ

(s′)−∇θVπ (s1)

Moving the terms around

∇θVπ (s1) =∑

s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

That is,∇θρ (π, s1) =

∑s

dπ(s)∑

a

Qπ(s, a)∇θπ(s, a)

Page 24: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

24/74

Remarks on Policy gradient theorem

The key aspect of the expression for the gradient is that there areno terms of the form ∂dπθ (s)

∂θ : the effect of policy changes on thedistribution of states does not appear. This is convenient forapproximating the gradient by sampling. For example, if s wassampled from the distribution obtained by following πθ, then∑

a Qπθ(s, a)∇θπθ(s, a) would be an unbiased estimate of∇θρ (πθ). Of course, Qπθ(s, a) is also not normally known andmust be estimated (in an unbiased way).

Page 25: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

25/74

Estimation using samples

Suppose we run policy π several times starting from s1 toobserve sample trajectories

{τ i}

. Then, at each time step t in atrajectory τ , set

Qt :=

T−1∑t′=t

γt′−trt′

where rt′ is the observed reward at time t′. Then, Qt is anunbiased estimate of Q (st, at) at time t (almost, assuming largeenough T), i.e,

E[Qt|st, at

]= Q (st, at)

Page 26: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

26/74

Estimation using samples

LetFt := Qt∇θ log πθ (st, at)

Then,

E [Ft|st] = E[E[Qt|st, at

]∇θ log (πθ (st, at)) |st

]= E [Q (st, at)∇θ log (πθ (st, at)) |st]

=∑

a

Q (st, a)∇θπθ (st, a)

Page 27: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

27/74

Estimation using samples

And,

E

[ ∞∑t=1

γt−1Ft

]=

∞∑t=1

γt−1∑

s

E [Ft|st = s] Pr (st = s|s1)

=

∞∑t=1

γt−1∑

s

(∑a

Q(s, a)∇θπθ(s, a)

)Pr (st = s|s1)

=∑

s

(∑a

Q(s, a)∇θπθ(s, a)

) ∞∑t=1

γt−1 Pr (st = s|s1)

=∑

s

(∑a

Q(s, a)∇θπθ(s, a)

)dπ(s)

= ∇θρ (πθ)

where the last step follows from the policy gradient theorem.

Page 28: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

28/74

Estimation using samples

Therefore, following is an unbiased (almost, for large T) estimateof gradient of ∇θρ (π, s1) of policy π at θ, starting in state s1:

g =

T∑t=1

γt−1Ft =

T∑t=1

γt−1Qt∇θ log (πθ (st, at))

From N sample trajectories{τ i, i = 1, . . . ,N

}we can develop

sample average estimate

g =1N

N∑i=1

T∑t=1

γt−1Fit =

1N

N∑i=1

T∑t=1

γt−1Qit∇θ log

(πθ(si

t, ait))

Page 29: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

29/74

Baseline

We can obtain another unbiased estimate by replacing Ft by

F′t =(

Qt − bt (st))∇θ log (πθ (st, at))

for an arbitrary baseline function bt(s). Then,

E

[∑t

γt−1F′t

]= E

[∑t

γt−1Ft

]− E

[∑t

γt−1bt (st)∇θ log (πθ (st, at))

]

= E

[∑t

γt−1Ft

]

Page 30: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

30/74

Baseline

The last step follows because:

E [bt (st)∇θ log (πθ (st, at)) |st] = bt (st)∑

a

∇θπθ (st, a) = bt (st)∇θ(1) = 0

That is, following is also an unbiased gradient estimate

g =1N

N∑i=1

T∑t=1

γt−1(

Qit − bt

(si

t))∇θ log

(π(si

t, ait))

(2)

The difference Qt − bt (st) is also referred to as Advantage. Thisterminology appears in more recent algorithms like’Asynchronous Advantage Actor Critic Algorithm (A3C)’ [Mnih etal., 2016] and ’Generalized Advantage Estimation (GAE)’[Schulman et al., 2015].

Page 31: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

31/74

Vanilla Policy Gradient Algorithm

Initialize policy parameter θ, and baseline function bt(s), ∀s.In each iteration,

Execute current policy πθ to obtain several sample trajectories τ i,i = 1, . . . ,m.For any given sample trajectory i, use observed rewardsr1, r2, . . . , to compute Qi

t := 1π(si

t,ait)

∑T−1t′=t γ

t′−trt′ .

Use Qit and baseline function bt(s) to compute the gradient

estimator g as in (2).Update θ ← θ + αg.Re-optimize baseline.

Page 32: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

32/74

Softmax policies

Consider policy set parameterized by θ ∈ Rd such that givens ∈ S, probability of picking action a ∈ A is given by:

πθ(s, a) =eθ>φsa∑

a′∈A eθ>φsa′

where each φsa is an d-dimensional feature vector characterizingstate-action pair s, a. This is a popular form of policy spacecalled softmax policies. Here,

∇θ log (πθ(s, a)) = φsa −

(∑a′∈A

φsa′πθ(s, a′

))= φsa − Ea′∼π(s) [φsa′ ]

Page 33: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

33/74

Gaussian policy for continuous action spaces

In continuous action spaces, it is natural to use Gaussianpolicies. Given state s, the probability of action a is given as:

πθ(s, a) = N(φ(s)Tθ, σ2)

for some constant σ. Here φ(s) is a feature representation of s.Then,

∇θ log (πθ(s, a)) = ∇θ−(a− θ>φ(s)

)2

2σ2 =

(θ>φ(s)− a

)σ2 φ(s)

Page 34: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

34/74

Outline

1 Policy gradient methodsFinite horizon MDPInfinite horizon case

2 Actor-critic methods

3 TRPO and PPO

4 AlphaGo Zero

Page 35: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

35/74

Actor-critic methods

The methods we have seen so far can be divided into two categories[Konda and Tsitsiklis, 2003]

Actor-only methods (vanilla policy gradient) work with aparameterized family of policies. The gradient of theperformance, with respect to the actor parameters, is directlyestimated by simulation, and the parameters are updated in adirection of improvement. A possible drawback of such methodsis that the gradient estimators may have a large variance.Furthermore, as the policy changes, a new gradient is estimatedindependently of past estimates (by sampling trajectories).Hence, there is no "learning", in the sense of accumulation andconsolidation of older information.

Page 36: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

36/74

Actor-critic methods

Critic-only methods (e.g., Q-learning, TD-learning) relyexclusively on value function approximation and aim at learningan approximate solution to the Bellman equation, which will thenhopefully prescribe a near-optimal policy. Such methods areindirect in the sense that they do not try to optimize directly overa policy space. A method of this type may succeed inconstructing a "good" approximation of the value function, yetlack reliable guarantees in terms of near-optimality of theresulting policy.

Page 37: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

37/74

Actor-critic methods

Actor-critic methods aim at combining the strong points of actor-onlyand critic-only methods, by incorporating value functionapproximation in the policy gradient methods. We already saw thepotential of using value function approximation for picking baseline forvariance reduction. Another more obvious place to incorporateQ-value approximation is for approximating Q-function in the policygradient expression. Recall, by policy gradient theorem:

∇θρ (πθ) =∑

s

dπθ(s)Ea∼π(s) [(Qπθ(s, a)− bπθ(s))∇θ log (πθ(s, a))]

for any baseline bπθ(·).

Page 38: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

38/74

Actor-critic methods

In the vanilla policy gradient algorithm, we essentially approximatedthe Q-value function by Monte-Carlo estimation. In every iteration, wesampled multiple independent trajectory to implicitly do a Q-valueestimation. Thus, as the policy changes, a new gradient is estimatedindependently of past estimates. In the actor-critic method, we useQ-function approximation in the gradient estimate. The actor-criticalgorithm simultaneously/alternatively updates the Q-functionapproximation as well as policy parameters, as it sees more samples.Over iterations, as policy changes, the Q-function approximationsalso improves with more sample observations.

Before describing the algorithm, let’s first understand therequirements from such an approximation - what kind of Q-functionapproximations are desirable (at least in ideal conditions)?

Page 39: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

39/74

Theorem 1 (Sutton et al. [1999])If function fω is compatible with policy parametrization θ in the sensethat for every s, a,

∇ωfω(s, a) =1

πθ(s, a)∇θπθ(s, a) = ∇θ log (πθ(s, a))

And, further we are given parameter ω which is a stationary point ofthe following least squares problem:

minω

Es∼dπθ ,a∼πθ(s,)

[(Qπθ(s, a)− b(s; θ)− fω(s, a))2

]where b(·; θ) is any baseline, which may depend on the current policyπθ. Then,

∇θρ (πθ) = Es∼dπθEa∈πθ(s) [fω(s, a)∇θ log (πθ(s, a))]

That is, function approximation fω can be used in place of Q-functionto obtain gradient with respect to θ.

Page 40: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

40/74

Proof of Policy gradient theorem

(Here, we abuse the notation and use Es∼dπθ [x] as a shorthand for∑s dπθ(s)x. This is not technically correct in the discounted case since

in that case dπθ(s) = Es1

[∑∞t=1 Pr (st = s;π, s1) γt−1

], which is not a

distribution. In fact in discounted case, (1− γ)dπθ is a distribution.)Proof. Given θ, for stationary point ω of the least squares problem:

Es∼dπθ,a∼πθ(s,·) [(Qπθ(s, a)− b(s; θ)− fω(s, a))∇ωfω(s, a)] = 0

Substituting the compatibility condition:

Es∼dπθ ,a∼πθ(s,·)

[(Qπθ(s, a)− b(s; θ)− fω(s, a))∇θπθ(s, a)

1πθ(s, a)

]= 0

Page 41: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

41/74

Proof of Policy gradient theorem

Or, ∑s

dπθ(s)∑

a

∇θπθ(s, a) (Qπθ(s, a)− b(s; θ)− fω(s, a)) = 0

Since b(s; θ)∑

a∇θπθ(s, a) = 0∑s

dπθ(s)∑

a

∇θπθ(s, a) (Qπθ(s, a)− fω(s, a)) = 0

using this with the policy gradient theorem, we get

∇θρ (πθ) =∑

s

dπθ(s)∑

a

∇θπθ(s, a)fω(s, a)

Page 42: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

42/74

Example: softmax policy

Consider policy set parameterized by θ such that given s ∈ S,probability of picking action a ∈ A is given by:

πθ(s, a) =eθ>φsa∑

a′∈A eθ>φsa′

where each φsa is an `-dimensional feature vector characterizingstate-action pair s, a. This is a popular form of parameterization.Here,

∇θπθ(s, a) = φsaπθ(s, a)−

(∑a′∈A

φsa′πθ(s, a′

))πθ(s, a)

Page 43: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

43/74

Example: softmax policy

Meeting the compatibility condition in Theorem 1 requires that

∇ωfω(s, a) =1

πθ(s, a)∇θπθ(s, a) = φsa −

∑a′∈A

φsa′πθ(s, a′

)A natural form of fω(s, a) satisfying this condition is:

fω(s, a) = ω>

(φsa −

∑b∈A

φsbπθ(s, b)

)

Thus fω must be linear in the same features as the policy, exceptnormalized to be mean zero for each state. In this sense it isbetter to think of fω as an approximation of the advantagefunction, Aπ(s, a) = Qπ(s, a)− Vπ(s), rather than Qπ.

Page 44: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

44/74

Example: Gaussian policy for continuous actionspaces

In continuous action spaces, it is natural to use Gaussian policy.Given state s, the probability of action a is given as:

πθ(s, a) = N(φ(s)Tθ, σ2)

for some constant σ. Here φ(s) is a feature representation of s.Then, compatibility condition for fω(s, a):

∇ωfω(s, a) = ∇θ log (πθ(s, a)) = ∇θ−(a− θ>φ(s)

)2

2σ2 =

(θ>φ(s)− a

)σ2 φ(s)

For fω to satisfy this, it must be linear in ω, e.g.,

fω(s, a) =(θ>φ(s)−a)

σ2 φ(s)>ω

Page 45: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

45/74

Policy iteration algorithm with function approximation

Let fω(·, ·) be such that ∇fω(s, a) = ∇θ log πθ(s, a) for all ω, θ, s, a.Initialize θ1, π1 := πθ1 . Pick step sizes α1, α2, . . . ,.In iteration k = 1, 2, 3, . . . ,

Policy evaluation: Find wk = w such that

Es∼dπkEa∼πk(s) [(Qπk(s, a)− bk(s)− fω(s, a))∇θ log (πk(s, a))] = 0

(Here, dπk is not normalized to 1, and sums to 1/(1− γ).)Policy improvement:

θk+1 ← θk + αkEs∼dπkEa∼πk(s) [fω(s, a)∇θ log (πk(s, a))]

A similar algorithm appears in Konda and Tsitsiklis [1999].

Page 46: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

46/74

Convergence Guarantees

Following version of convergence guarantees were provided bySutton et al. [1999] for infinite horizon MDPs (average or discounted).

Theorem 2 (Sutton et al. [1999])Given α1, α2, . . . , such that

limT→∞

T∑k=1

αk =∞, limT→∞

T∑k=1

α2k <∞

and maxθ,s,a,i,j∂2πθ(s,a)∂θiθj

<∞. Then, for θ1, θ2, . . . , obtained by theabove algorithm,

limk→∞

∇θρ (θ)|θk= 0

Page 47: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

47/74

Pseudocode: Vanilla Policy Gradient Algorithm

Algorithm 1 Vanilla Policy Gradient Algorithm1: Input: initial policy parameters θ0, initial value function parameters φ02: for k = 0, 1, 2, ... do3: Collect set of trajectoriesDk = {τi} by running policy πk = π(θk) in the environment.4: Compute rewards-to-go Rt .5: Compute advantage estimates, At (using any method of advantage estimation) based on the current value function Vφk

.

6: Estimate policy gradient as

gk =1

|Dk|

∑τ∈Dk

T∑t=0

∇θ log πθ(at|st)|θkAt.

7: Compute policy update, either using standard gradient ascent,

θk+1 = θk + αk gk,

or via another gradient ascent algorithm like Adam.8: Fit value function by regression on mean-squared error:

φk+1 = arg minφ

1

|Dk|T

∑τ∈Dk

T∑t=0

(Vφ(st)− Rt

)2,

typically via some gradient descent algorithm.9: end for

Page 48: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

48/74

Outline

1 Policy gradient methodsFinite horizon MDPInfinite horizon case

2 Actor-critic methods

3 TRPO and PPO

4 AlphaGo Zero

Page 49: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

49/74

Trust Region Policy Optimization (John Schulman, etc)

Assume start-state distribution d0 is independent with policy

Total expected discounted reward with policy π

η(π) = Eπ[

∞∑t=0

γtr(st)]

Between any two different policy π and π

η(π) = η(π) + Eπ[

∞∑t=0

γtAπ(st, at)]

= η(π) +

∞∑t=0

∑s

P(st = s|π)∑

a

π(a|s)γtAπ(s, a)

= η(π) +∑

s

∞∑t=0

γtP(st = s|π)∑

a

π(a|s)Aπ(s, a)

= η(π) +∑

s

dπ(s)∑

a

π(a|s)Aπ(s, a).

Page 50: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

50/74

Trust Region Policy Optimization

Find new policy π to maximize η(π)− η(π) for given π, that is

maxπ

η(π) +∑

s

dπ(s)∑

a

π(a|s)Aπ(s, a)

For simpleness, maximize the approximator

Lπ(π) = η(π) +∑

s

dπ(s)∑

a

π(a|s)Aπ(s, a)

Parameterize the policy π(a|s) := πθ(a|s)

Lπθold(πθ) = η(πθold ) +

∑s

dπθold(s)∑

a

πθ(a|s)Aπθold(s, a)

Page 51: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

51/74

Why Lπθold(πθ)?

A sufficiently small step θold → θ improves Lπθold(πθ) also improves η

Lπθold(πθold ) =η(πθold ),

∇θLπθold(πθ)|θ=θold =∇θη(πθ)|θ=θold .

Lower bounds on the improvement of η

η(πθnew) ≥ Lπθold(πθnew)− 2εγ

(1− γ)2α2

where

ε = maxs|Ea∼πθnew

Aπθold(s, a)|

α =DmaxTV (πθold ||πθnew) = max

sDTV(πθold (·|s)||πθnew(·|s))

Page 52: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

52/74

Lower bound

TV divergence between two distribution p, q (discrete case)

DTV(p‖q) =12

∑X

|p(X)− q(X)|

KL divergence between two distribution p, q (discrete case)

DKL(p‖q) =∑

X

p(X) logp(X)

q(X)

(DTV(p||q))2 ≤ DKL(p||q) (Pollard(2000),Ch.3)

Thus obtain a lower bound

η(πθnew) ≥ Lπθold(πθnew)− 2εγ

(1− γ)2α

where

α = DmaxKL (πθold ||πθnew) := max

sDKL(πθold (·|s)||πθnew(·|s))

Page 53: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

53/74

Practical algorithm

The penalty coefficient 2εγ(1−γ)2 is large in practice, which yields small

update

Take a constraint on the KL divergence, i.e., a trust region constraint:

maxθ

Lπθold(πθ)

s.t. DmaxKL (πθold ||πθ) ≤ δ

A heuristic approximation

maxθ

Lπθold(πθ)

s.t. DρπθoldKL (πθold ||πθ) ≤ δ

where

DρπθoldKL (πθold ||πθ) = Eπθold

(DKL(πθold (·|s)||πθ(·|s))

Page 54: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

54/74

TRPO

The objective and constraint are both zero when θ = θk.Furthermore, the gradient of the constraint with respect to θ iszero when θ = θk.

The theoretical TRPO update isn’t the easiest to work with, soTRPO makes some approximations to get an answer quickly. WeTaylor expand the objective and constraint to leading orderaround θk:

Lθk(θ) ≈ gT(θ − θk)

DKL(θk||θ) ≈12

(θ − θk)TH(θ − θk)

resulting in an approximate optimization problem,

θk+1 = arg maxθ

gT(θ − θk)

s.t.12

(θ − θk)TH(θ − θk) ≤ δ.

Page 55: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

55/74

TRPO

By happy coincidence, the gradient g of the surrogate advantagefunction with respect to θ, evaluated at θ = θk, is exactly equal tothe policy gradient, ∇θJ(πθ)

This approximate problem can be analytically solved by themethods of Lagrangian duality, yielding the solution:

θk+1 = θk +

√2δ

gTH−1gH−1g.

TRPO adds a modification to this update rule: a backtracking linesearch,

θk+1 = θk + αj

√2δ

gTH−1gH−1g,

where α ∈ (0, 1) is the backtracking coefficient, and j is thesmallest nonnegative integer such that πθk+1 satisfies the KLconstraint and produces a positive surrogate advantage.

Page 56: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

56/74

TRPO

computing and storing the matrix inverse, H−1, is painfullyexpensive when dealing with neural network policies withthousands or millions of parameters. TRPO sidesteps the issueby using the ‘conjugate gradient’ algorithm to solve Hx = g forx = H−1g, requiring only a function which can compute thematrix-vector product Hx instead of computing and storing thewhole matrix H directly.

Hx = ∇θ(

(∇θDKL(θk||θ))T x)

Page 57: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

57/74

Pseudocode: TRPO

Algorithm 2 Trust Region Policy Optimization1: Input: initial policy θ0, initial value function φ0, KL-divergence limit δ, backtracking coefficient α2: for k = 0, 1, 2, ... do3: Collect set of trajectoriesDk = {τi} by running policy πk = π(θk) in the environment.4: Compute rewards-to-go Rt .5: Compute advantage estimates, At (using any method of advantage estimation) based on the current value function Vφk

.

6: Estimate policy gradient as

gk =1

|Dk|

∑τ∈Dk

T∑t=0

∇θ log πθ(at|st)|θkAt.

7: Use the conjugate gradient algorithm to compute xk ≈ H−1k gk , where Hk is the Hessian of the sample average KL-

divergence.

8: Update the policy by backtracking line search with θk+1 = θk + αj√

2δxTk Hk xk

xk , where j is the smallest value which

improves the sample loss and satisfies the sample KL-divergence constraint.9: Fit value function by regression on mean-squared error:

φk+1 = arg minφ

1

|Dk|T

∑τ∈Dk

T∑t=0

(Vφ(st)− Rt

)2,

typically via some gradient descent algorithm.10: end for

Page 58: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

58/74

Proximal Policy Optimization (PPO)

PPO is motivated by the same question as TRPO: how can wetake the biggest possible improvement step on a policy using thedata we currently have, without stepping so far that weaccidentally cause performance collapse? Where TRPO tries tosolve this problem with a complex second-order method, PPO isa family of first-order methods that use a few other tricks to keepnew policies close to old.

PPO-Penalty approximately solves a KL-constrained update likeTRPO, but penalizes the KL-divergence in the objective functioninstead of making it a hard constraint, and automatically adjuststhe penalty coefficient over the course of training so that it’sscaled appropriately.

PPO-Clip doesn’t have a KL-divergence term in the objectiveand doesn’t have a constraint at all. Instead relies on specializedclipping in the objective function to remove incentives for the newpolicy to get far from the old policy.

Page 59: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

59/74

Key Equations

PPO-clip updates policies via

θk+1 = arg maxθ

Es,a∼πθk

[L(s, a, θk, θ)] ,

typically taking multiple steps of (usually minibatch) SGD tomaximize the objective. Here L is given by

L(s, a, θk, θ) = min

(πθ(a|s)πθk (a|s)

Aπθk (s, a), clip

(πθ(a|s)πθk (a|s)

, 1− ε, 1 + ε

)Aπθk (s, a)

),

in which ε is a (small) hyperparameter which roughly says howfar away the new policy is allowed to go from the old.

Page 60: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

60/74

This is a pretty complex expression, and it’s hard to tell at firstglance what it’s doing, or how it helps keep the new policy closeto the old policy. As it turns out, there’s a considerably simplifiedversion of this objective which is a bit easier to grapple with

L(s, a, θk, θ) = min

(πθ(a|s)πθk(a|s)

Aπθk (s, a), g(ε,Aπθk (s, a))

),

where

g(ε,A) =

{(1 + ε)A A ≥ 0(1− ε)A A < 0.

To figure out what intuition to take away from this, let’s look at asingle state-action pair (s, a), and think of cases.

Page 61: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

61/74

Pseudocode: PPO

Algorithm 3 PPO-Clip1: Input: initial policy parameters θ0, initial value function parameters φ02: for k = 0, 1, 2, ... do3: Collect set of trajectoriesDk = {τi} by running policy πk = π(θk) in the environment.4: Compute rewards-to-go Rt .5: Compute advantage estimates, At (using any method of advantage estimation) based on the current value function Vφk

.

6: Update the policy by maximizing the PPO-Clip objective:

θk+1 = arg maxθ

1

|Dk|T

∑τ∈Dk

T∑t=0

min

(πθ(at|st)

πθk(at|st)

Aπθk (st, at), g(ε, A

πθk (st, at))

),

typically via stochastic gradient ascent with Adam.7: Fit value function by regression on mean-squared error:

φk+1 = arg minφ

1

|Dk|T

∑τ∈Dk

T∑t=0

(Vφ(st)− Rt

)2,

typically via some gradient descent algorithm.8: end for

Page 62: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

62/74

Outline

1 Policy gradient methodsFinite horizon MDPInfinite horizon case

2 Actor-critic methods

3 TRPO and PPO

4 AlphaGo Zero

Page 63: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

63/74

Monte-Carlo Tree Search (MCTS)

Selection: Starting at the root node, a child selection policy isrecursively applied to descend through the tree until the mosturgent expandable node is reached. A node is expandable if itrepresents a nonterminal state and has unvisited (i.e.unexpanded) children.

Expansion: One (or more) child nodes are added to expand thetree, according to the available actions.

Simulation: A simulation is run from the new node(s) accordingto the default policy to produce an outcome.

Backpropagation: The simulation result is “backed up” (i.e.backpropagated) through the selected nodes to update theirstatistics.

Page 64: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

64/74

Main steps of MCTS

Cameron Browne, Edward Powley, Daniel Whitehouse, etc, A Surveyof Monte Carlo Tree Search Methods

Page 65: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

65/74

Upper Confidence Bounds for Trees (UCT)

every time a node (action) is to be selected within the existingtree, the choice may be modelled as an independent multi-armedbandit problem. A child node j is selected to maximize

UCT = Xj + 2Cp

√2 ln n

nj

n the number of times the current (parent) node has been visited,nj the number of times child j has been visited and Cp > 0 is aconstant.

The values of Xi,t and thus of Xj are understood to be within [0, 1]

Each node v has four pieces of data associated with it: theassociated state s(v), the incoming action a(v), the totalsimulation reward Q(v) (a vector of real values), and the visitcount N(v) (a nonnegative integer).

Page 66: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

66/74

Upper Confidence Bounds for Trees (UCT)

Algorithm 4 UCTFunction: UCTSearch(s0)create root node v0 with state s0while within computational budget do

vl ← TreePolicy(v0)∆← DefaultPolicy(s(vl))BACKUP(vl,∆))

end whilereturn a(BestChild(v0, 0))

Function: TreePolicy(v)while v is nonterminal do

if not fully expanded thenreturn Expand(v)

elsev← BestChild(v, Cp)

end ifend whilereturn v

Function: Expand(v)choose a ∈ untried actions from A(s(v))add a new child v′ to v

with s(v′) = f (s(v), a)and a(v′) = a

return v′

Algorithm 5 UCT (continue)Function: BestChild(v, c)

return argmaxv′∈ child of v

Q(v′)N(v′) + c

√2 ln N(v)

N(v′)

Function: DefaultPolicy(s)while s is non-terminal do

choose a ∈ A(s) uniformly at randoms← f (s, a)

end whilereturn reward for state s

Function: Backup(v,∆)while v is not full do

N(v)← N(v) + 1Q(v)← Q(v) + ∆(v, p)v← parent of v

end while

Page 67: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

67/74

AlphaGo Zero

Page 68: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

68/74

AlphaGo Zero

Page 69: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

69/74

AlphaGo Zero

Page 70: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

70/74

AlphaGo Zero

Page 71: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

71/74

AlphaGo Zero

Page 72: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

72/74

AlphaGo Zero

Page 73: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

73/74

AlphaGo Zero

Page 74: Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

74/74

AlphaGo Zero