short reading for thursday job talk at 1:30pm in etrl 101 kuka robotics –

• Short reading for Thursday• Job talk at 1:30pm in ETRL 101• Kuka robotics– http://www.kuka-timoboll.com/en/home/

http://www.kuka-timoboll.com/en/home/

Unified View

On-line, Tabular TD(λ)

• Flappy Bird: state space?– http://sarvagyavaish.github.io/FlappyBirdRL/

http://sarvagyavaish.github.io/FlappyBirdRL/

http://sarvagyavaish.github.io/FlappyBirdRL/

Chapter 9: Generalization and Function Approximation

• How does experience in parts of the state space help us act over the entire state space?

• How can does function approximation (supervised learning) merge with RL?

• Function approximator convergence

Chapter 9: Generalization and Function Approximation

• How does experience in parts of the state space help us act over the entire state space?

• How can does function approximation (supervised learning) merge with RL?

• Function approximator convergence

• “I read it and it mostly makes sense.”• “There are many methods to do [function

approximation], most of which made very little sense as explained.”

• Instead of lookup table for values of V at time t (Vt), consider some kind of weight vector wt

• E.g., wt could be the weights in a neural network

• Instead of one value (weight) per state, now we update this vector

Insight: Steal from Existing Supervised Learning Methods!

• Training = {X,Y}• Error = target output – actual output

TD Backups as Training Examples

• Recall the TD(0) backup:

• As a training example:– Input = Features of st

– Target output = rt+1 + γV(st+1)

What FA methods can we use?

• In principle, anything!– Neural networks– Decision trees– Multivariate regression– Support Vector Machines– Gaussian Processes– Etc.

• But, we normally want to– Learn while interacting– Handle nonstationarity– Not take “too long” or use “too much” memory– Etc.

Perceptron

• Binary, linear classifier: Rosenblatt, 1957• Eventual failure of perceptron to do

“everything” shifted field of AI towards symbolic representations

• Sum = w1x1 + w2x2 + … + wnxn

• Output is +1 if sum > 0, -1 otherwise• wj = wj + (target – output) xj

• Also, can use x0 = 1 and w0 is therefore a bias

Perceptron

• Consider Perceptron with 3 weights:• x, y, bias

Spatial-based Perceptron Weights

Neural Networks

• How do we get around only linear solutions?

Neural Networks

• A multi-layer network of linear perceptrons is still linear.

• Non-linear (differentiable) units• Logistic or tanh function

Intermission

• UAS in France– http://www.uasvision.com/2014/02/18/18-year-ol

d-in-nancy-prosecuted-for-uas-video-on-youtube/

http://www.uasvision.com/2014/02/18/18-year-old-in-nancy-prosecuted-for-uas-video-on-youtube/



Gradient Descent

• w = (w1, w2, …, wn)T

• Assume Vt(s) sufficiently smooth differential function of w, for all states s in S

• Also, assume that training examples are in the form:

• Features of st Vπ(st)• Goal: minimize errors on the observed samples

• wt+1=wt + α[Vπ(st)-Vt(st)] wtVt(st)

• Vector of partial derivatives

Δ

• Let J(w) be any function of the weight space• The gradient at any point wt in this space is:

• Then, to iteratively move down the gradient:

• Why still doing this iteratively? If you could just eliminate the error, why could this be a bad idea?

• Common goal is to minimize mean-squared error (MSE) over distribution d:

• Why does this make any sense?• d is distribution of states receiving backups• on- or off-policy distribution

• (Dmitry)• Motivation for choosing MSE:– MSE - 2-norm of the error.– 1) Square of the norm is a sum of the squares, and

its derivative is a linear function, that is good.– 2) QR decomposition is used to get nice solution

for linear approximation problems• Find x which minimizes 2-norm of(A*x-b)• Other norms don't such simple solution

Gradient Descent

• Each sample gradient is an unbiased estimate of the true gradient• This will converge to a local minimum of the MSE if α decreases

“appropriately” over time

• Unfortunately, we don’t actually have vπ(s)• Instead, we just have an estimate of the target

Vt

• If the Vt is an an unbiased estimate of vπ(st), then we’ll converge to a local minimum (again with α caveat)

• δ is or normal TD error• e is vector of eligibility traces• θ is a weight vector

• Note that TD(λ) targets are biased• But… we do it anyway

Linear Methods

• Why are these a particularly important type of function approximation?

• Parameter vector θt

• Column vector of features φs for every state • (same number of components)

Linear Methods

• Gradient is simple• Error surface for MSE is simple (single

minimum)

• Coarse coding• Generalization based on features activating

Size Matters

Tile coding

Tile coding, view #2

• Consider a game of soccer

But, how do you pick the “coarseness”?

Adaptive tile codingIFSA

Irregular tilings

Radial Basis Functions

• Instead of binary, have degrees of activation

• Can combine with tile coding!

• Kanerva Coding: choose “prototype states” and consider distance from prototype states

• Now, updates depend on number of features, not number of dimensions

• Instance-based methods

Fitted R-Max [Jong and Stone, 2007]

x

y ?

• Instance-based RL method [Ormoneit & Sen, 2002]

• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action


x

y



Mountain-Car Task

3D Mountain Car

• X: position and acceleration• Y: position and acceleration

• Control with FA• Bootstrapping

Efficiency in ML / AI

1. Data efficiency (rate of learning)2. Computational efficiency (memory,

computation, communication)3. Researcher efficiency (autonomy, ease of

setup, parameter tuning, priors, labels, expertise)

• Todd’s work with decision trees• Course feedback

short reading for thursday job talk at 1:30pm in etrl 101 kuka robotics –

Documents