powerpoint presentation - chapter 1 introduction9 example 1: cart pole avoid failure: the pole...
TRANSCRIPT
REINFORCEMENT LEARNING
❐ The Problem❐ Reinforcement Learning Methods
Monte-Carlo learning Temporal-difference learning TD(λ)
❐ Function approximation
1
David Silver
1. THE PROBLEM
❐ What is reinforcement learning?
Learn from Experience To ...
❐ Fly stunt manoeuvres in a helicopter❐ Defeat the world champion at Backgammon/Checkers/Go❐ Manage an investment portfolio❐ Control a power station❐ Walk a robot through an obstacle course
5
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t = 0, 1, 2, K Agent observes state at step t : st ∈S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ℜ
and resulting next state : st+1
t. . . st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3 . . .t +3a
Textbook
❐ Sutton & Barto, “An Introduction to Reinforcement Learning”
❐ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html
6
Slides adapted from:http://webdocs.cs.ualberta.ca/~sutton/RLslides.tgz
7
Policy at step t , πt : a mapping from states to action probabilities πt (s, a) = probability that at = a when st = s
The Agent Learns a Policy
❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience.
❐ Roughly, the agent’s goal is to get as much reward as it can over the long run.
Rt = rt+1 + rt+2 + ...+ rT
Rt = rt+1 + γrt+2 + γ2rt+3 + ...
8
Returns
Episodic tasks: interaction breaks naturally into terminating episodes, e.g., games, trips through a maze.
The return Rt is the total reward from time t onwards
Continuing tasks: interaction does not have natural episodes.
shortsighted 0 ←γ → 1 farsighted
Where is the discount rateγ, 0 ≤ γ ≤ 1
9
Example 1: Cart Pole
Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward = +1 for each step before failure⇒ return = number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:reward = −1 upon failure; 0 otherwise⇒ return = −γ k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
10
Example 2: Computer Go
Play Go against a human opponent
As an episodic task where each game is an episode
Return is maximized by beating opponent.
The computer player is the agentThe human opponent is the environment
reward = +1 if computer player wins, 0 if human wins, 0 for every non-terminal move
=> return = win/lose game
11
The Markov Property
❐ The state is the information available to the agent at step t❐ The state can be immediate observations, or can be built up over
time from sequences of observations❐ Ideally, a state should summarize the information contained in
all past observations ❐ This is known as the Markov Property:
Pr(st+1 = s�, rt+1 = r|st, at, rt, st−1, at−1..., r1, s0, a0)
= Pr(st+1 = s�, rt+1 = r|st)
12
Markov Decision Processes
❐ If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).
❐ If state and action sets are finite, it is a finite MDP. ❐ To define a finite MDP, you need to give:
state and action sets one-step “dynamics” defined by transition probabilities:
reward probabilities:€
Ps ʹ′ s a = Pr st +1 = ʹ′ s st = s,at = a{ } for all s, ʹ′ s ∈ S, a∈ A(s).
€
Rs ʹ′ s a = E rt +1 st = s,at = a,st +1 = ʹ′ s { } for all s, ʹ′ s ∈ S, a∈ A(s).
13
Value Functions
State - value function for policy π :
Vπ (s) = Eπ Rt st = s{ } = Eπ γ krt+k +1 st = sk =0
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
Action- value function for policy π :
Qπ (s, a) = Eπ Rt st = s, at = a{ } = Eπ γ krt+ k+1 st = s,at = ak= 0
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
❐ The value of a state is the expected return starting from that state; depends on the agent’s policy:
❐ The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π :
14
Gridworld
❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that
move agent out of special states A and B as shown.
State-value function for equiprobable random policy;γ = 0.9
15
Optimal Value Functions❐ There are always one or more policies that are better than or
equal to all the others. These are the optimal policies. We denote them by π *.
❐ Optimal policies share the same optimal state-value function:
❐ Optimal policies also share the same optimal action-value function:
V∗ (s) = maxπVπ (s) for all s ∈S
Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)
This is the expected return for taking action a in state s and thereafter following an optimal policy.
16
Why Optimal State-Value Functions are Useful
V∗
V∗
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld:
π*
Summary
❐ Many important control problems can be formalised as reinforcement learning problems
❐ The agent’s goal is to maximise its return (long-term reward) given its experience
❐ The value function summarises the long-term consequences of following a policy
❐ There is a unique optimal value function for any optimal policy
❐ The policy evaluation problem is to find Vπ(s)❐ The control problem is to find V*(s)
17
2. MONTE-CARLO METHODS
❐ Learning from sample episodes
19
Short History of Monte-Carlo Methods
❐ Developed in Los Alamos during World War II By John von Neumann and Stanislaw Ulam Central to success of the Manhattan Project
❐ Solve massive problems with no analytical solution❐ Using random sampling instead of exhaustive computation❐ John von Neumann chose Monte-Carlo as codename
20
Monte-Carlo Evaluation
❐ Estimates value of policy π, Vπ(s) ❐ Learns from complete episodes of experience❐ Model-free: don’t need to know the MDP dynamics❐ Sample-based: samples state transitions at random, rather
than exhaustively searching all possible state transitions
21
Monte-Carlo Evaluation
❐ Monte-Carlo evaluation is very simple: Remember all returns from each state s V(s) is the average return starting from state s and
following π❐ V(s) converges to V π(s) if s is visited infinitely often
V (st) ← V (st) + αt [Rt − V (st)]
αt =1
N(st)
Incremental Averaging
22
❐ Don’t have to remember all returns:
❐ If then this computes the average return from s
❐ If α is constant then this computes a recent average return❐ Useful for tracking non-stationary values
23
Monte-Carlo in Blackjack
❐ Object: Have your card sum be greater than the dealers without exceeding 21.
❐ States (200 of them): player’s sum dealer’s showing card useable ace?
❐ Reward: +1 for winning, 0 for a draw, -1 for losing❐ Actions: stick (no more cards), twist (receive another card)❐ Policy: Stick if my sum is 20 or 21, else twist
24
Blackjack value functions
25
Monte Carlo Estimation of Action Values (Q)
❐ Q(s,a) is the average return starting from state s and action a following π
❐ Q(s,a) converges to Qπ(s,a) if every state-action pair is visited infinitely many times
❐ e.g. Use ε-greedy policy: Select greedy action with probability 1 - ε Select random action with probability ε
26
Monte-Carlo Control
❐ Evaluation: evaluate current policy using MC simulation❐ Improvement: update policy to act greedily with respect to
action-value function❐ Converges on optimal policy, if ε is slowly reduced to 0
27
Blackjack example continued
Summary
❐ Monte-Carlo evaluation is the simplest policy evaluation method Model-free Sample-based Complete episodes (no bootstrapping)
❐ Monte-Carlo control is the simplest control method Monte-Carlo evaluation ε-greedy policy improvement
❐ Simple and effective solution to many problems❐ Good convergence properties
28
3. TEMPORAL-DIFFERENCE LEARNING
❐ Learning a guess from a guess
30
TD Prediction
Simple every - visit Monte Carlo method :
V(st )← V(st) +α Rt − V (st )[ ]
Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ
Recall:
The simplest TD method, TD(0) :
V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]
target: the actual return after time t
target: an estimate of the return
incremental
31
Monte-Carlo Simulation
T T T TT
T T T T T
V(st )← V(st) +α Rt − V (st )[ ]where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
32
Temporal-Difference Learning
T T T TT
T T T T T
st+1rt+1
st
V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]
TTTTT
T T T T T
33
cf. Full-width backups
V(st )← Eπ rt+1 +γ V(st ){ }
T
T T T
st
rt+1st+1
T
TT
T
TT
T
T
T
34
Advantages of TD Learning
❐ TD methods do not require a model of the environment, only experience
❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome You can learn without the final outcome
– From incomplete sequences❐ TD reduces variance but increases bias
MC depends on long sequences of random variables TD depends only on one step
❐ Both MC and TD converge on optimal value
35
TD and MC on the Random Walk
36
Learning An Action-Value Function
Estimate Qπ for the current behavior policy π.
After every transition from a nonterminal state st , do this :
Q st , at( )←Q st , at( ) + α rt+1 +γ Q st+1,at+1( ) −Q st ,at( )[ ]If st+1 is terminal, then Q(st+1, at+1 ) = 0.
37
Sarsa: On-Policy TD Control
Policy evaluation: temporal-difference learningPolicy improvement: ε-greedy
38
Windy Gridworld
undiscounted, episodic, reward = –1 until goal
39
Results of Sarsa on the Windy Gridworld
40
Q-Learning: Off-Policy TD Control
One - step Q - learning :
Q st , at( )←Q st , at( ) + α rt+1 +γ maxaQ st+1, a( ) −Q st , at( )[ ]
41
Cliffwalking
ε−greedy, ε = 0.1
Summary
❐ Temporal-difference learning is: Model-free Sample-based Bootstrapping (learns from incomplete episodes)
❐ Can use TD for control: On-policy (SARSA) Off-policy (Q-learning)
❐ TD is lower variance but higher bias than MC❐ Usually more efficient in practice
42
4. TD(λ)
❐ Between MC and TD
44
n-step TD Prediction
❐ Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
45
n-Step TD Learning on Random Walks
❐ Task: 19 state random walk
❐ Optimal n depends on task
46
TD(λ)
❐ TD(λ) is a method for averaging all n-step backups weight by λn-1 (time since
visitation) λ-return:
❐ Backup using λ-return:
€
Rtλ = (1− λ) λn−1
n=1
∞
∑ Rt(n )
€
ΔVt (st ) =α Rtλ −Vt (st )[ ]
47
λ-Return Weighting Function
€
Rtλ = (1− λ) λn−1
n=1
T− t−1
∑ Rt(n ) + λT− t−1Rt
Until termination After termination
48
TD(λ) on the Random Walk
❐ Same 19 state random walk as before
49
Relation to TD(0) and MC
❐ λ-return can be rewritten as:
❐ If λ = 1, you get MC:
❐ If λ = 0, you get TD(0)
€
Rtλ = (1− λ) λn−1
n=1
T− t−1
∑ Rt(n ) + λT− t−1Rt
€
Rtλ = (1−1) 1n−1
n=1
T− t−1
∑ Rt(n ) +1T− t−1Rt = Rt
€
Rtλ = (1− 0) 0n−1
n=1
T− t−1
∑ Rt(n ) + 0T− t−1Rt = Rt
(1)
Until termination After termination
50
Unified View
5. FUNCTION APPROXIMATION
❐ RL in large state spaces
52
Function Approximation
❐ Can’t store the value of every state Too many states / continuous state space Limited memory Too slow to learn individual values / no generalisation
❐ Use function approximation to approximate value function Using small number of parameters Generalising between similar states
❐ Adjust parameters by gradient descent (or similar idea)
Vt(s) = f(s, �θt)
53
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ
e.g.,
r θ t could be the vector of connection weights
of a neural network.
€
θ t = θt (1),θt (2),…,θt (n)( )T
transposeParameters:
Value function approximation:
54
Monte-Carlo Training Examples
€
description of st , rt+1 + γV (st+1){ }
As a training example:
input target output
Monte-Carlo backup:V (st) ← V (st) + αt [Rt − V (st)]
Rt }
55
TD Training Examples
€
e.g., the TD(0) backup : V (st )←V (st ) +α rt+1 + γV (st+1) −V (st )[ ]
€
description of st , rt+1 + γV (st+1){ }
As a training example:
input target output
56
Gradient Descent
€
Let f be any function of the parameter space.
Its gradient at any point θ t in this space is :
∇ θ f ( θ t ) = ∂f (
θ t )
∂θ(1),∂f ( θ t )
∂θ(2),…,∂f (
θ t )
∂θ(n)
⎛
⎝ ⎜
⎞
⎠ ⎟
T
.
θ (1)
θ (2) r θ t = θ t(1),θ t(2)( )T
€
θ t+1 =
θ t −α∇
θ f ( θ t )
Iteratively move down the gradient:
57
Linear Methods
€
Represent states as feature vectors :for each s∈ S : φ s = φs(1),φs(2),…,φs(n)( )T
€
Vt (s) = θ tT φ s = θt (i)φs(i)
i=1
n
∑
€
∇ θ Vt (s) =
φ s
V (s) = φ(s)T �θ
δt = rt+1 + γV (st+1)− V (st)
∆�θ = αδφ(st)
58
Linear Temporal-Difference Learning
Update parameters in direction that reduces TD-error:
59
Coarse Coding
60
Tile Coding
❐ Binary feature for each tile❐ Number of features present at
any one time is constant❐ Binary features means weighted
sum easy to compute❐ Easy to compute indices of the
freatures present
61
Mountain-Car Task
62
Mountain Car with Radial Basis Functions
Convergence
❐ Monte-Carlo learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators
❐ Temporal-difference learning converges with: On-policy learning (e.g. not Q-learning) Linear function approximation
❐ Gradient temporal-difference learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators Sutton et al. (ICML 2009)
63
SUMMARY
❐ Reinforcement learning: Learn from experience to achieve more reward
❐ Problem can be reduced to: Policy evaluation (prediction) Control
❐ Model-free solution methods: Monte-Carlo simulation and Monte-Carlo control Temporal-difference learning, SARSA, Q-learning TD(λ) provides spectrum between MC and TD
❐ Function approximation: Scale up MC and TD to large state spaces
64
QUESTIONS
❐ The only stupid question is the one you never asked– Richard Sutton
65