sequential decision making

30
Sequential Decision Making Lecture 5 : Beyond Value-Based Methods Emilie Kaufmann Ecole Centrale Lille, 2021/2022 Emilie Kaufmann | CRIStAL -1

Upload: others

Post on 30-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequential Decision Making

Sequential Decision MakingLecture 5 : Beyond Value-Based Methods

Emilie Kaufmann

Ecole Centrale Lille, 2021/2022

Emilie Kaufmann |CRIStAL - 1

Page 2: Sequential Decision Making

Reminder

Until now we have seen Value-Based methods , that learn

Q(s, a)

an estimate of the optimal Q-Value function

Q?(s, a) = maxπ

Qπ(s, a)

= maxπ

Eπ[ ∞∑

t=1

γt−1r(st , at)

∣∣∣∣∣ s1 = s, a1 = a

]

Ü our guess for the optimal policy is then π = greedy(Q) :

π(s) = argmaxa∈A

Q(s, a)

(a deterministic policy)

Emilie Kaufmann |CRIStAL - 2

Page 3: Sequential Decision Making

Outline

1 Optimizing Over Policies

2 Policy Gradients

3 The REINFORCE algorithm

4 Advantage Actor Critic

Emilie Kaufmann |CRIStAL - 3

Page 4: Sequential Decision Making

Optimizing over policies ?

We could try to

argmaxπ∈Π

Eπ[ ∞∑

t=1

γt−1r(st , at)

∣∣∣∣∣ s1 ∼ ρ

]

whereΠ = stationary, deterministic policies π : S → A

and ρ is a distribution over first states.

Ü intractable !

Idea : relax this optimization problem by searching over a (smoothly)parameterized set of stochastic policies.

Emilie Kaufmann |CRIStAL - 4

Page 5: Sequential Decision Making

A new objective

I parametric family of stochastic policies πθθ∈Θ

I πθ(a|s) : probability of choosing a in s, given θ

I θ 7→ πθ(a|s) is assumed to be differentiable

Goal : find θ that maximizes

J(θ) = Eπθ

[ ∞∑t=1

γt−1r(st , at)

∣∣∣∣∣ s1 ∼ ρ

]

over the parameter space Θ.

Idea : use gradient ascent

Ü How to compute the gradient ∇θJ(θ) ?

Ü How to estimate it using trajectories ?

Emilie Kaufmann |CRIStAL - 5

Page 6: Sequential Decision Making

Warm-up : Computing gradients

I f : X → R is a (non differentiable) function

I pθθ∈Θ is a set of probability distributions over X

J(θ) = EX∼pθ [f (X )]

Proposition

∇θJ(θ) = EX∼pθ [f (X )∇ log pθ(X )]

Exercice : Prove it !

Emilie Kaufmann |CRIStAL - 6

Page 7: Sequential Decision Making

Outline

1 Optimizing Over Policies

2 Policy Gradients

3 The REINFORCE algorithm

4 Advantage Actor Critic

Emilie Kaufmann |CRIStAL - 7

Page 8: Sequential Decision Making

Finite-Horizon objective

J(θ) = Eπθ

[T∑t=1

γt−1r(st , at)

∣∣∣∣∣ s1 ∼ ρ

]for some γ ∈ (0, 1].

I τ = (s1, a1, s2, a2, . . . , sT , aT ) trajectory of length T

I πθ induces a distribution pθ over trajectories :

pθ(τ) = ρ(s1)T∏t=1

πθ(at |st)p(st+1|st , at)

I cumulative discounted reward over the trajectory :

R(τ) :=T∑t=1

γt−1r(st , at)

Emilie Kaufmann |CRIStAL - 8

Page 9: Sequential Decision Making

Finite-Horizon objective

J(θ) = Eτ∼pθ[R(τ)

]for some γ ∈ (0, 1].

I τ = (s1, a1, s2, a2, . . . , sT , aT ) trajectory of length T

I πθ induces a distribution pθ over trajectories :

pθ(τ) = ρ(s1)T∏t=1

πθ(at |st)p(st+1|st , at)

I cumulative discounted reward over the trajectory :

R(τ) :=T∑t=1

γt−1r(st , at)

Emilie Kaufmann |CRIStAL - 8

Page 10: Sequential Decision Making

Computing the gradient

∇θJ(θ) = Eτ∼pθ [R(τ)∇θ log pθ(τ)]

and

∇θ log pθ(τ) = ∇θ log

(ρ(s1)

T∏t=1

πθ(at |st)p(st+1|st , at)

)

= ∇θT∑t=1

(log ρ(s1) + log p(st+1|st , at) + log πθ(at |st))

=T∑t=1

∇θ log πθ(at |st)

Hence,

∇θJ(θ) = Eτ∼pθ

[T∑t=1

R(τ)∇θ log πθ(at |st)

]

Emilie Kaufmann |CRIStAL - 9

Page 11: Sequential Decision Making

The baseline trick

∇θJ(θ) = Eτ∼pθ

[T∑t=1

R(τ)∇θ log πθ(at |st)

]In each step t, we may substrack a baseline function bt(s1, a1, . . . , st),which depends on the beginning of the trajectory (up to st), i.e.

∇θJ(θ) = Eτ∼pθ

[T∑t=1

(R(τ)− bt(s1, a1, . . . , st))∇θ log πθ(at |st)

]Why ?

Eτ∼pθ [bt(s1, a1, . . . , st)∇θ log πθ(at |st)|s1, a1, . . . , st ]

= bt(s1, a1, . . . , st)∑a∈A

πθ(a|st)∇θ log πθ(a|st)

= bt(s1, a1, . . . , st)∑a∈A∇θπθ(a|st)

= bt(s1, a1, . . . , st)∇θ

(∑a∈A

πθ(a|st)

)︸ ︷︷ ︸

=1

= 0

Emilie Kaufmann |CRIStAL - 10

Page 12: Sequential Decision Making

Choosing a baseline

∇θJ(θ) = Eτ∼pθ

[T∑t=1

(R(τ)− bt(s1, a1, . . . , st))∇θ log πθ(at |st)

]

A common choice is

bt(s1, a1, . . . , st) =t−1∑i=1

γt−1r(si , ai )

which leads to

R(τ)− bt(s1, a1, . . . , st) =T∑i=t

γ i−1r(si , ai )

= γt−1T∑i=t

γ i−tr(si , ai )︸ ︷︷ ︸discounted sum of rewards

starting from st

Emilie Kaufmann |CRIStAL - 11

Page 13: Sequential Decision Making

Policy Gradient Theorem

Using this baseline, we obtain

∇θJ(θ) = Eτ∼pθ

[T∑t=1

γt−1

(T∑i=t

γ i−tr(si , ai )

)∇θ log πθ(at |st)

]

= Eπθ

[T∑t=1

γt−1Qπθt (st , at)∇θ log πθ(at |st)

]

where

Qπt (s, a) = Eπ

[T∑i=t

γ i−tr(si , ai )

∣∣∣∣∣ st = s, at = a

]

Emilie Kaufmann |CRIStAL - 12

Page 14: Sequential Decision Making

Policy Gradient Theorem : Infinite Horizon

J(θ) = Eπθ

[ ∞∑t=1

γt−1r(st , at)

∣∣∣∣∣ s1 ∼ ρ

](taking the limit when T →∞ of the previous objective)

Policy Gradient Theorem [Sutton et al., 1999]

∇θJ(θ) = Eπθ

[ ∞∑t=1

γt−1Qπθ (st , at)∇θ log πθ(st |at)

]where Qπ(s, a) is the usual Q-value function of policy π.

Remark : sometimes written

∇θJ(θ) =1

1− γE(s,a)∼dπ

[Qπθ (s, a)∇θ log πθ(s|a)

]with dπ(s, a) = (1− γ)

∑∞t=1 γ

t−1Pπ(St = s,At = a).Emilie Kaufmann |CRIStAL - 13

Page 15: Sequential Decision Making

Outline

1 Optimizing Over Policies

2 Policy Gradients

3 The REINFORCE algorithm

4 Advantage Actor Critic

Emilie Kaufmann |CRIStAL - 14

Page 16: Sequential Decision Making

Recap : Exact gradients

I Finite horizon

∇θJ(θ) = Eπθ

[T∑t=1

γt−1Qπθt (st , at)∇θ log πθ(st |at)

]

I Infinite horizon

∇θJ(θ) = Eπθ

[ ∞∑t=1

γt−1Qπθ (st , at)∇θ log πθ(st |at)

]

Ü simple formulations to propose unbiaised estimates of the gradientsbased on trajectories (almost unbiaised for infinite horizon)

Emilie Kaufmann |CRIStAL - 15

Page 17: Sequential Decision Making

REINFORCE

I Initialize θ arbitrarily

I In each step, generate N trajectories of length T under πθ

(s(i)1 , a

(i)1 , r

(i)1 , . . . , s

(i)T , a

(i)T , r

(i)T )i=1,...,N

compute a Monte-Carlo estimate of the gradient

∇θJ(θ) =1

N

N∑i=1

T∑t=1

γtG(i)t ∇θ log πθ(a

(i)t |s

(i)t )

with G(i)t =

∑Ts=t γ

s−tr(i)s .

I Update θ ← θ + α∇θJ(θ)

(one may use N = 1, and T large enough so that γT/(1− γ) is small)

Emilie Kaufmann |CRIStAL - 16

Page 18: Sequential Decision Making

REINFORCE

I Initialize θ arbitrarily

I In each step, generate N trajectories of length T under πθ

(s(i)1 , a

(i)1 , r

(i)1 , . . . , s

(i)T , a

(i)T , r

(i)T )i=1,...,N

compute a Monte-Carlo estimate of the gradient

∇θJ(θ) =1

N

N∑i=1

T∑t=1

γtG(i)t ∇θ log πθ(a

(i)t |s

(i)t )

with G(i)t =

∑Ts=t γ

s−tr(i)s .

I Update θ ← θ + α∇θJ(θ)

(one may use N = 1, and T large enough so that γT/(1− γ) is small)

Emilie Kaufmann |CRIStAL - 16

Page 19: Sequential Decision Making

Choosing the policy class

A common choice when A is finite is a softmax policy

∀a ∈ A, πθ(a|s) =exp(κfθ(s, a))∑

a′∈A exp(κfθ(s, a′))

I if S is finite, one may use fθ(s, a) = θs,a Θ = RS×A

I otherwise, fθ(s, a) is a function a some parametric space(e.g. a neural network)

∇θ log πθ(a|s) = κ∇θfθ(s, a)− κ∑a′∈A

πθ(a′|s)∇θfθ(s, a′)

Emilie Kaufmann |CRIStAL - 17

Page 20: Sequential Decision Making

Choosing the policy class

Policy gradient algorithms permit to handle continuous action spaces aswell. For example, we may use a Gaussian policy with density

πθ(a|s) =1√

2πσ2θ2

(s)exp

(− (a− µθ1 (s))2

2σ2θ2

(s)

)

∇θ1 log π(a|s) =(a− µθ1 (s))

σ2θ2

(s)∇θ1µθ1 (s)

∇θ2 log π(a|s) =(a− µθ1 (s))2 − σ2

θ2(s)

σ3θ2

(s)∇θ2σθ2 (s)

Emilie Kaufmann |CRIStAL - 18

Page 21: Sequential Decision Making

Limitation

The gradient estimated by REINFORCE can have a large variance

Two ideas to overcome this problem :

I use better baselines

I use a different estimate of Qπθ (s, a)(which will create biais)

Emilie Kaufmann |CRIStAL - 19

Page 22: Sequential Decision Making

Outline

1 Optimizing Over Policies

2 Policy Gradients

3 The REINFORCE algorithm

4 Advantage Actor Critic

Emilie Kaufmann |CRIStAL - 20

Page 23: Sequential Decision Making

Baseline trick, reloaded

One can further substract the baseline b(s1, a1, . . . , st) = V πθ (st) :

∇θJ(θ) = Eπθ

[ ∞∑t=1

γt−1Qπθ (st , at)∇θ log πθ(st |at)

]

= Eπθ

[ ∞∑t=1

γt−1 (Qπθ (st , at)− V πθ (st))∇θ log πθ(st |at)]

= Eπθ

[ ∞∑t=1

γt−1Aπθ (st , at)∇θ log πθ(st |at)]

introducing the advantage function

Aπ(s, a) = Qπ(s, a)− V π(s)

= Qπ(s, a)− Qπ(s, π(s))

(how good it is to replace the first action by a when following π ?)

Emilie Kaufmann |CRIStAL - 21

Page 24: Sequential Decision Making

Estimating the advantage

I Assume we have access to V , an estimate of V πθ

I The advantage function in (st , at) can be estimated using the nexttransition by

A(st , at) = rt + γV (st+1)− V (st)

or more transitions

A(st , at) =

t+p∑k=t

γk−trk + γp+1V (st+p+1)− V (st)

I This leads to a gradient estimator from (multiple) trajectories

∇θJ(θ) =1

N

N∑i=1

T∑t=1

γt−1A(s

(i)t , a

(i)t

)∇θ log πθ

(a

(i)t |s

(i)t

)

Ü How do we produce the estimates V ? Use a critic

Emilie Kaufmann |CRIStAL - 22

Page 25: Sequential Decision Making

Estimating the advantage

I Assume we have access to V , an estimate of V πθ

I The advantage function in (st , at) can be estimated using the nexttransition by

A(st , at) = rt + γV (st+1)− V (st)

or more transitions

A(st , at) =

t+p∑k=t

γk−trk + γp+1V (st+p+1)− V (st)

I This leads to a gradient estimator from (multiple) trajectories

∇θJ(θ) =1

N

N∑i=1

T∑t=1

γt−1A(s

(i)t , a

(i)t

)∇θ log πθ

(a

(i)t |s

(i)t

)

Ü How do we produce the estimates V ? Use a critic

Emilie Kaufmann |CRIStAL - 22

Page 26: Sequential Decision Making

Actor critic algorithms

I Actor : maintains a policy and performs trajectory under it

I Critic : maintain a value, which estimates the value of the policyfollowed by the critic

Rationale :

I the critic’s policy improves the value given by the critic

I the critic uses the trajectories generated by the critic to update itsevaluation of the value

Ü Generalized Policy Iteration

Both the actor and the critic can use parametric representation :

I πθ : the actor’s policy, θ ∈ Θ

I Vω : the critic’s value, ω ∈ Ω

Emilie Kaufmann |CRIStAL - 23

Page 27: Sequential Decision Making

How to update the critic ?

I Idea 1 : use TD(0)

after each observed transition under πθ,

δt = rt + γVω(st+1)− Vω(st)

ω ← ω + αδt∇ωVω(st)

I Idea 2 : use batches and bootstrapping

V (s(i)t ) =

t+p∑k=t

γk+trt + γp+1Vω(s(i)t+p+1)

and minimize the loss with respect to ω :

1

N

N∑i=1

T∑t=1

(V (s

(i)t )− Vω(s

(i)t ))2

Emilie Kaufmann |CRIStAL - 24

Page 28: Sequential Decision Making

The A2C algorithm[Mnih et al., 2016]

In each iteration :

I collect M transitions under the policy πθ (with reset of initial statesif a terminal state is reached) (sk , ak , rk , sk+1)k∈[M]

I compute the (bootstrap) Monte-Carlo estimate

V (sk) = Q(sk , ak) =

τk∨M∑t=K

γt−k rt + γM−k+1Vω(sM+1)1(τk > M)

and advantage estimates Aω(sk , ak) = Q(sk , ak)− Vω(sk).I one gradient step to minimize the policy loss : θ ← θ + α∇θLπ(θ)

Lπ(ω) = −1

M

M∑k=1

Aω(sk , ak) log πθ(ak |sk)−γ

M

M∑k=1

∑a

πθ(a|sk) log1

πθ(a|sk)

I one gradient step to minimize the value loss : ω ← ω + α∇ωLV (ω)

LV (ω) =1

M

M∑k=1

(V (sk)− Vω(sk)

)2

Emilie Kaufmann |CRIStAL - 25

Page 29: Sequential Decision Making

Policy Gradient Algorithms :Pros and Cons

+ allows conservative policy updates (not just taking argmax), whichmake learning more stable

+ easy to implement and can handle continuous state and actionspaces

+ the use of randomized policies allows for some exploration...

- ... but not always enough

- requires a lot of samples

- controlling the variance of the grandient can be hard(many tricks for variance reduction)

- the loss fonction J(θ) is not concave, how to avoid local maxima ?

Emilie Kaufmann |CRIStAL - 26

Page 30: Sequential Decision Making

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver,D., and Kavukcuoglu, K. (2016).

Asynchronous methods for deep reinforcement learning.

In Proceedings of the 33nd International Conference on Machine Learning,(ICML).

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999).

Policy gradient methods for reinforcement learning with function approximation.

In Advances in Neural Information Processing Systems (NIPS).