powerpoint presentation - chapter 1 introduction9 example 1: cart pole avoid failure: the pole...

REINFORCEMENT LEARNING

❐ The Problem❐ Reinforcement Learning Methods

Monte-Carlo learning Temporal-difference learning TD(λ)

❐ Function approximation

1

David Silver

1. THE PROBLEM

❐ What is reinforcement learning?

Learn from Experience To ...

❐ Fly stunt manoeuvres in a helicopter❐ Defeat the world champion at Backgammon/Checkers/Go❐ Manage an investment portfolio❐ Control a power station❐ Walk a robot through an obstacle course

5

The Agent-Environment Interface

Agent and environment interact at discrete time steps : t = 0, 1, 2, K Agent observes state at step t : st ∈S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ℜ

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3 . . .t +3a

Textbook

❐ Sutton & Barto, “An Introduction to Reinforcement Learning”

❐ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

6

Slides adapted from:http://webdocs.cs.ualberta.ca/~sutton/RLslides.tgz

http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

7

Policy at step t , πt : a mapping from states to action probabilities πt (s, a) = probability that at = a when st = s

The Agent Learns a Policy

❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience.

❐ Roughly, the agent’s goal is to get as much reward as it can over the long run.

Rt = rt+1 + rt+2 + ...+ rT

Rt = rt+1 + γrt+2 + γ2rt+3 + ...

8

Returns

Episodic tasks: interaction breaks naturally into terminating episodes, e.g., games, trips through a maze.

The return Rt is the total reward from time t onwards

Continuing tasks: interaction does not have natural episodes.

shortsighted 0 ←γ → 1 farsighted

Where is the discount rateγ, 0 ≤ γ ≤ 1

9

Example 1: Cart Pole

Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.

reward = +1 for each step before failure⇒ return = number of steps before failure

As an episodic task where episode ends upon failure:

As a continuing task with discounted return:reward = −1 upon failure; 0 otherwise⇒ return = −γ k , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

10

Example 2: Computer Go

Play Go against a human opponent

As an episodic task where each game is an episode

Return is maximized by beating opponent.

The computer player is the agentThe human opponent is the environment

reward = +1 if computer player wins, 0 if human wins, 0 for every non-terminal move

=> return = win/lose game

11

The Markov Property

❐ The state is the information available to the agent at step t❐ The state can be immediate observations, or can be built up over

time from sequences of observations❐ Ideally, a state should summarize the information contained in

all past observations ❐ This is known as the Markov Property:

Pr(st+1 = s�, rt+1 = r|st, at, rt, st−1, at−1..., r1, s0, a0)

= Pr(st+1 = s�, rt+1 = r|st)

12

Markov Decision Processes

❐ If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).

❐ If state and action sets are finite, it is a finite MDP. ❐ To define a finite MDP, you need to give:

state and action sets one-step “dynamics” defined by transition probabilities:

reward probabilities:€

Ps ʹ′ s a = Pr st +1 = ʹ′ s st = s,at = a{ } for all s, ʹ′ s ∈ S, a∈ A(s).

€

Rs ʹ′ s a = E rt +1 st = s,at = a,st +1 = ʹ′ s { } for all s, ʹ′ s ∈ S, a∈ A(s).

13

Value Functions

State - value function for policy π :

Vπ (s) = Eπ Rt st = s{ } = Eπ γ krt+k +1 st = sk =0

∞

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

Action- value function for policy π :

Qπ (s, a) = Eπ Rt st = s, at = a{ } = Eπ γ krt+ k+1 st = s,at = ak= 0

∞

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

❐ The value of a state is the expected return starting from that state; depends on the agent’s policy:

❐ The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π :

14

Gridworld

❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that

move agent out of special states A and B as shown.

State-value function for equiprobable random policy;γ = 0.9

15

Optimal Value Functions❐ There are always one or more policies that are better than or

equal to all the others. These are the optimal policies. We denote them by π *.

❐ Optimal policies share the same optimal state-value function:

❐ Optimal policies also share the same optimal action-value function:

V∗ (s) = maxπVπ (s) for all s ∈S

Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)

This is the expected return for taking action a in state s and thereafter following an optimal policy.

16

Why Optimal State-Value Functions are Useful

V∗

V∗

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

π*

Summary

❐ Many important control problems can be formalised as reinforcement learning problems

❐ The agent’s goal is to maximise its return (long-term reward) given its experience

❐ The value function summarises the long-term consequences of following a policy

❐ There is a unique optimal value function for any optimal policy

❐ The policy evaluation problem is to find Vπ(s)❐ The control problem is to find V*(s)

17

2. MONTE-CARLO METHODS

❐ Learning from sample episodes

19

Short History of Monte-Carlo Methods

❐ Developed in Los Alamos during World War II By John von Neumann and Stanislaw Ulam Central to success of the Manhattan Project

❐ Solve massive problems with no analytical solution❐ Using random sampling instead of exhaustive computation❐ John von Neumann chose Monte-Carlo as codename

20

Monte-Carlo Evaluation

❐ Estimates value of policy π, Vπ(s) ❐ Learns from complete episodes of experience❐ Model-free: don’t need to know the MDP dynamics❐ Sample-based: samples state transitions at random, rather

than exhaustively searching all possible state transitions

21

Monte-Carlo Evaluation

❐ Monte-Carlo evaluation is very simple: Remember all returns from each state s V(s) is the average return starting from state s and

following π❐ V(s) converges to V π(s) if s is visited infinitely often

V (st) ← V (st) + αt [Rt − V (st)]

αt =1

N(st)

Incremental Averaging

22

❐ Don’t have to remember all returns:

❐ If then this computes the average return from s

❐ If α is constant then this computes a recent average return❐ Useful for tracking non-stationary values

23

Monte-Carlo in Blackjack

❐ Object: Have your card sum be greater than the dealers without exceeding 21.

❐ States (200 of them): player’s sum dealer’s showing card useable ace?

❐ Reward: +1 for winning, 0 for a draw, -1 for losing❐ Actions: stick (no more cards), twist (receive another card)❐ Policy: Stick if my sum is 20 or 21, else twist

24

Blackjack value functions

25

Monte Carlo Estimation of Action Values (Q)

❐ Q(s,a) is the average return starting from state s and action a following π

❐ Q(s,a) converges to Qπ(s,a) if every state-action pair is visited infinitely many times

❐ e.g. Use ε-greedy policy: Select greedy action with probability 1 - ε Select random action with probability ε

26

Monte-Carlo Control

❐ Evaluation: evaluate current policy using MC simulation❐ Improvement: update policy to act greedily with respect to

action-value function❐ Converges on optimal policy, if ε is slowly reduced to 0

27

Blackjack example continued

Summary

❐ Monte-Carlo evaluation is the simplest policy evaluation method Model-free Sample-based Complete episodes (no bootstrapping)

❐ Monte-Carlo control is the simplest control method Monte-Carlo evaluation ε-greedy policy improvement

❐ Simple and effective solution to many problems❐ Good convergence properties

28

3. TEMPORAL-DIFFERENCE LEARNING

❐ Learning a guess from a guess

30

TD Prediction

Simple every - visit Monte Carlo method :

V(st )← V(st) +α Rt − V (st )[ ]

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ

Recall:

The simplest TD method, TD(0) :

V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]

target: the actual return after time t

target: an estimate of the return

incremental

31

Monte-Carlo Simulation

T T T TT

T T T T T

V(st )← V(st) +α Rt − V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

32

Temporal-Difference Learning

T T T TT

T T T T T

st+1rt+1

st

V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]

TTTTT

T T T T T

33

cf. Full-width backups

V(st )← Eπ rt+1 +γ V(st ){ }

T

T T T

st

rt+1st+1

T

TT

T

TT

T

T

T

34

Advantages of TD Learning

❐ TD methods do not require a model of the environment, only experience

❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome You can learn without the final outcome

– From incomplete sequences❐ TD reduces variance but increases bias

MC depends on long sequences of random variables TD depends only on one step

❐ Both MC and TD converge on optimal value

35

TD and MC on the Random Walk

36

Learning An Action-Value Function

Estimate Qπ for the current behavior policy π.

After every transition from a nonterminal state st , do this :

Q st , at( )←Q st , at( ) + α rt+1 +γ Q st+1,at+1( ) −Q st ,at( )[ ]If st+1 is terminal, then Q(st+1, at+1 ) = 0.

37

Sarsa: On-Policy TD Control

Policy evaluation: temporal-difference learningPolicy improvement: ε-greedy

38

Windy Gridworld

undiscounted, episodic, reward = –1 until goal

39

Results of Sarsa on the Windy Gridworld

40

Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st , at( )←Q st , at( ) + α rt+1 +γ maxaQ st+1, a( ) −Q st , at( )[ ]

41

Cliffwalking

ε−greedy, ε = 0.1

Summary

❐ Temporal-difference learning is: Model-free Sample-based Bootstrapping (learns from incomplete episodes)

❐ Can use TD for control: On-policy (SARSA) Off-policy (Q-learning)

❐ TD is lower variance but higher bias than MC❐ Usually more efficient in practice

42

4. TD(λ)

❐ Between MC and TD

44

n-step TD Prediction

❐ Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

45

n-Step TD Learning on Random Walks

❐ Task: 19 state random walk

❐ Optimal n depends on task

46

TD(λ)

❐ TD(λ) is a method for averaging all n-step backups weight by λn-1 (time since

visitation) λ-return:

❐ Backup using λ-return:

€

Rtλ = (1− λ) λn−1

n=1

∞

∑ Rt(n )

€

ΔVt (st ) =α Rtλ −Vt (st )[ ]

47

λ-Return Weighting Function

€

Rtλ = (1− λ) λn−1

n=1

T− t−1

∑ Rt(n ) + λT− t−1Rt

Until termination After termination

48

TD(λ) on the Random Walk

❐ Same 19 state random walk as before

49

Relation to TD(0) and MC

❐ λ-return can be rewritten as:

❐ If λ = 1, you get MC:

❐ If λ = 0, you get TD(0)

€

Rtλ = (1− λ) λn−1

n=1

T− t−1

∑ Rt(n ) + λT− t−1Rt

€

Rtλ = (1−1) 1n−1

n=1

T− t−1

∑ Rt(n ) +1T− t−1Rt = Rt

€

Rtλ = (1− 0) 0n−1

n=1

T− t−1

∑ Rt(n ) + 0T− t−1Rt = Rt

(1)

Until termination After termination

50

Unified View

5. FUNCTION APPROXIMATION

❐ RL in large state spaces

52

Function Approximation

❐ Can’t store the value of every state Too many states / continuous state space Limited memory Too slow to learn individual values / no generalisation

❐ Use function approximation to approximate value function Using small number of parameters Generalising between similar states

❐ Adjust parameters by gradient descent (or similar idea)

Vt(s) = f(s, �θt)

53

Value Prediction with FA

As usual: Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ

e.g.,

r θ t could be the vector of connection weights

of a neural network.

€

θ t = θt (1),θt (2),…,θt (n)( )T

transposeParameters:

Value function approximation:

54

Monte-Carlo Training Examples

€

description of st , rt+1 + γV (st+1){ }

As a training example:

input target output

Monte-Carlo backup:V (st) ← V (st) + αt [Rt − V (st)]

Rt }

55

TD Training Examples

€

e.g., the TD(0) backup : V (st )←V (st ) +α rt+1 + γV (st+1) −V (st )[ ]

€

description of st , rt+1 + γV (st+1){ }

As a training example:

input target output

56

Gradient Descent

€

Let f be any function of the parameter space.

Its gradient at any point θ t in this space is :

∇ θ f ( θ t ) = ∂f (

θ t )

∂θ(1),∂f ( θ t )

∂θ(2),…,∂f (

θ t )

∂θ(n)

⎛

⎝ ⎜

⎞

⎠ ⎟

T

.

θ (1)

θ (2) r θ t = θ t(1),θ t(2)( )T

€

θ t+1 =

θ t −α∇

θ f ( θ t )

Iteratively move down the gradient:

57

Linear Methods

€

Represent states as feature vectors :for each s∈ S : φ s = φs(1),φs(2),…,φs(n)( )T

€

Vt (s) = θ tT φ s = θt (i)φs(i)

i=1

n

∑

€

∇ θ Vt (s) =

φ s

V (s) = φ(s)T �θ

δt = rt+1 + γV (st+1)− V (st)

∆�θ = αδφ(st)

58

Linear Temporal-Difference Learning

Update parameters in direction that reduces TD-error:

59

Coarse Coding

60

Tile Coding

❐ Binary feature for each tile❐ Number of features present at

any one time is constant❐ Binary features means weighted

sum easy to compute❐ Easy to compute indices of the

freatures present

61

Mountain-Car Task

62

Mountain Car with Radial Basis Functions

Convergence

❐ Monte-Carlo learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators

❐ Temporal-difference learning converges with: On-policy learning (e.g. not Q-learning) Linear function approximation

❐ Gradient temporal-difference learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators Sutton et al. (ICML 2009)

63

SUMMARY

❐ Reinforcement learning: Learn from experience to achieve more reward

❐ Problem can be reduced to: Policy evaluation (prediction) Control

❐ Model-free solution methods: Monte-Carlo simulation and Monte-Carlo control Temporal-difference learning, SARSA, Q-learning TD(λ) provides spectrum between MC and TD

❐ Function approximation: Scale up MC and TD to large state spaces

64

QUESTIONS

❐ The only stupid question is the one you never asked– Richard Sutton

65

powerpoint presentation - chapter 1 introduction9 example 1: cart pole avoid failure: the pole...

Documents