![Page 1: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/1.jpg)
A Closer Look at Function Approximation
Robert Platt Northeastern University
![Page 2: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/2.jpg)
The problem of large and continuous state spaces
Example of a large state space: Atari Learning Environment– state: video game screen– actions: joystick actions– reward: game score
Agent
a
s,r
Agent takes actions
Agent perceives states and rewards
Why are large state spaces a problem for tabular methods?1. many states may never be visited2. there is no notion that the agent should behave similarly in
“similar” states.
![Page 3: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/3.jpg)
Function approximation
Approximating the Value function using function approximator:
Some kind of function approximator parameterized
by w
![Page 4: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/4.jpg)
Which Function Approximator?
There are many function approximators, e.g. – Linear combinations of features – Neural networks – Decision tree – Nearest Neighbour– Fourier / wavelet bases
We will require the function approximator to be differentiable
Need to be able to handle non-stationary, non-iid data
![Page 5: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/5.jpg)
Approximating value function using SGD
Goal: find parameter vector w minimizing mean-squared error between approximate value fn, , and the true value function,
Approach: do gradient descent on this cost function
For starters, let’s focus on policy evaluation, i.e. estimating
![Page 6: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/6.jpg)
Approximating value function using SGD
Goal: find parameter vector w minimizing mean-squared error between approximate value fn, , and the true value function,
Approach: do gradient descent on this cost function
Here’s the gradient:
For starters, let’s focus on policy evaluation, i.e. estimating
![Page 7: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/7.jpg)
Linear value function approximation
Let’s approximate as a linear function of features:
where x(s) is the feature vector:
![Page 8: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/8.jpg)
Think-pair-share
Can you think of some good features for pacman?
![Page 9: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/9.jpg)
Linear value function approx: coarse coding
For example, the elts in x(s) could correspond to regions of state space:
Binary features– one feature for each circle (above)
![Page 10: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/10.jpg)
Linear value function approx: coarse coding
For example, the elts in x(s) could correspond to regions of state space:
Binary features– one feature for each circle (above)
The value function is encoded by the combination of all tiles that a state intersects
![Page 11: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/11.jpg)
The effect of overlapping feature regions
![Page 12: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/12.jpg)
Think-pair-share
What type of linear features might be appropriate for this problem?
What is the relationship between feature shape and generalization?
Cliff region
Goal region
![Page 13: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/13.jpg)
Linear value function approx: tile coding
For example, x(s) could be constructed using tile coding:
– Each tiling is a partition of the state space.– Assigns each state to a unique tile.
Binary featuresn = num tiles x num tilingsIn this example: n = 16 x 4
![Page 14: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/14.jpg)
Think-pair-share
The value function is encoded by the combination of all tiles that a state intersects
State aggregation is a special case of tile coding.
How many tilings in this case?
What do the weights correspond to in this case?
Binary featuresn = num tiles x num tilingsIn this example: n = 16 x 4
![Page 15: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/15.jpg)
Think-pair-share
– what are the pros/cons of rectangular tiles like this?– what are the pros/cons to evenly spacing the tilings vs placing
them at uneven offsets?
Binary featuresn = num tiles x num tilingsIn this example: n = 16 x 4
![Page 16: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/16.jpg)
Recall monte carlo policy evaluation algorithm
Let’s think about how to do the same thing using function approximation...
![Page 17: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/17.jpg)
Gradient monte carlo policy evaluation
Notice that in MC, the return is an unbiased, noisy sample of the true value,
Can therefore apply supervised learning to “training data”:
The weight update “sampled” from the training data is:
Goal: calculate
![Page 18: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/18.jpg)
Gradient monte carlo policy evaluation
Notice that in MC, the return is an unbiased, noisy sample of the true value,
Can therefore apply supervised learning to “training data”:
The weight update “sampled” from the training data is:
For a linear function approximator, this is:
Goal: calculate
![Page 19: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/19.jpg)
Gradient monte carlo policy evaluation
For linear function approximation, gradient MC converges to the weights that minimize MSE wrt the true value function.
Even for non-linear function approximation, gradient MC converges to a local optimum.
However, since this is MC, the estimates are high-variance.
![Page 20: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/20.jpg)
Gradient MC example: 1000-state random walk
![Page 21: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/21.jpg)
Gradient MC example: 1000-state random walk
The whole value function over 1000 states will be approximated with 10 numbers!
![Page 22: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/22.jpg)
Question
The whole value function over 1000 states will be approximated with 10 numbers!
How many tilings are here?
![Page 23: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/23.jpg)
Gradient MC example: 1000-state random walk
![Page 24: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/24.jpg)
Gradient MC example: 1000-state random walk
Converges to unbiased value estimate
![Page 25: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/25.jpg)
Question
What is the relationship between the state distribution (mu) and the policy?
How do you correct for following a policy that visits states differently?
![Page 26: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/26.jpg)
TD Learning with value function approximation
The TD target, is an estimate of the true value,
But, let’s ignore that and use the TD target anyway…
Training data:
![Page 27: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/27.jpg)
TD Learning with value function approximation
The TD target, is an estimate of the true value,
But, let’s ignore that and use the TD target anyway…
Training data:
This gives us TD(0) policy evaluation with:
![Page 28: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/28.jpg)
TD Learning with value function approximation
The TD target, is an estimate of the true value,
But, let’s ignore that and use the TD target anyway…
Training data:
This gives us TD(0) policy evaluation with:
Next state
![Page 29: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/29.jpg)
TD Learning with value function approximation
![Page 30: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/30.jpg)
Think-pair-share
Why is this called “semi-gradient”?
Here’s the update rule we’re using:
Is this really the gradient?
What is the gradient actually?
Loss function:
![Page 31: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/31.jpg)
Semi-gradient TD(0) ex: 1000-state random walk
Converges to biased value estimate
![Page 32: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/32.jpg)
Convergence results summary
1. Gradient-MC converges for both linear and non-linear fn approx2. Gradient-MC converges to optimal value estimates
– converges to values that min MSE
3. Semi-gradient-TD(0) converges for linear fn approx4. Semi-gradient-TD(0) converges to a biased estimate
– converges to a point, , that does does not minimize MSE– but we have:
Fixed point for semi-gradient TD
Point that min MSE
![Page 33: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/33.jpg)
TD Learning with value function approximation
For linear function approximation, gradient TD(0) converges to biased estimate of weights such that:
Fixed point for semi-gradient TD Point that min MSE
![Page 34: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/34.jpg)
Think-pair-share
Write the semi-gradient weight update equation for the special case of linear function approximation.
How would you update this algorithm for q-learning?
![Page 35: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/35.jpg)
Linear Sarsa with Coarse Coding in Mountain Car
![Page 36: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/36.jpg)
Linear Sarsa with Coarse Coding in Mountain Car
![Page 37: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/37.jpg)
Least Squares Policy Iteration (LSPI)
Recall that for linear function approximation, J(w) is quadratic in the weights:
We can solve for w that min J(w) directly.
First, let’s think about this in the context of batch policy evaluation.
![Page 38: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/38.jpg)
Policy evaluation
Given:– a dataset generated using policy
Find w that min:
![Page 39: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/39.jpg)
Question
Given:– a dataset generated using policy
Find w that min:
HOW?
![Page 40: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/40.jpg)
Think-pair-share
Given: a dataset
Find w that min:
where a, b, w are scalars.
What if b is a vector?
![Page 41: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/41.jpg)
Policy evaluation
Given:– a dataset generated using policy
Find w that min:
1. Set derivative to zero:
![Page 42: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/42.jpg)
Policy evaluation
Given:– a dataset generated using policy
Find w that min:
1. Set derivative to zero:
2. Solve for w:
![Page 43: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/43.jpg)
LSMC policy evaluation
1. collect a bunch of experience under policy
2. calculate weights using:
![Page 44: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/44.jpg)
LSMC policy evaluation
1. collect a bunch of experience under policy
2. calculate weights using:
How to we ensure this matrix is well conditioned?
![Page 45: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/45.jpg)
Question
1. collect a bunch of experience under policy
2. calculate weights using:
What effect does this term have?
What cost function is being minimized now?
![Page 46: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/46.jpg)
LSMC policy iteration
1. Take an action according current policy,
2. Add experience to buffer:
3. Calculate new LS weights using:
4. Goto step 1
![Page 47: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/47.jpg)
Is there a TD version of this?
1. Take an action according current policy,
2. Add experience to buffer:
3. Calculate new LS weights using:
4. Goto step 1
MC target
![Page 48: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/48.jpg)
LSTD policy evaluation
In TD learning, the target is:
Substituting into the gradient of J(w):
Solving for w:
![Page 49: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/49.jpg)
LSTD policy evaluation
In TD learning, the target is:
Substituting into the gradient of J(w):
Solving for w (and add regularization term):
![Page 50: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/50.jpg)
LSTD policy evaluation
In TD learning, the target is:
Substituting into the gradient of J(w):
Solving for w (and add regularization term):
Notice this is slightly different from what was used for LSMC
![Page 51: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/51.jpg)
LSTD policy evaluation
1. collect a bunch of experience under policy
2. calculate weights using:
![Page 52: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/52.jpg)
LSTDQ
Approximate Q function as:
Now, the update is:
![Page 53: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/53.jpg)
LSPI-TD
Policy improvement
Guaranteed to converge to near-optimal (linear fn approx)
![Page 54: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/54.jpg)
Chain Walk Example
![Page 55: A Closer Look at Function Approximation · Think-pair-share – what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8add64f0f52571cb31d6d3/html5/thumbnails/55.jpg)
LSPI in Chain Walk: Action-Value Function
Notice that the policy is optimal after iteration 4