short reading for thursday job talk at 1:30pm in etrl 101 kuka robotics –
TRANSCRIPT
• Short reading for Thursday• Job talk at 1:30pm in ETRL 101• Kuka robotics– http://www.kuka-timoboll.com/en/home/
Unified View
On-line, Tabular TD(λ)
• Flappy Bird: state space?– http://sarvagyavaish.github.io/FlappyBirdRL/
Chapter 9: Generalization and Function Approximation
• How does experience in parts of the state space help us act over the entire state space?
• How can does function approximation (supervised learning) merge with RL?
• Function approximator convergence
Chapter 9: Generalization and Function Approximation
• How does experience in parts of the state space help us act over the entire state space?
• How can does function approximation (supervised learning) merge with RL?
• Function approximator convergence
• “I read it and it mostly makes sense.”• “There are many methods to do [function
approximation], most of which made very little sense as explained.”
• Instead of lookup table for values of V at time t (Vt), consider some kind of weight vector wt
• E.g., wt could be the weights in a neural network
• Instead of one value (weight) per state, now we update this vector
Insight: Steal from Existing Supervised Learning Methods!
• Training = {X,Y}• Error = target output – actual output
TD Backups as Training Examples
• Recall the TD(0) backup:
• As a training example:– Input = Features of st
– Target output = rt+1 + γV(st+1)
What FA methods can we use?
• In principle, anything!– Neural networks– Decision trees– Multivariate regression– Support Vector Machines– Gaussian Processes– Etc.
• But, we normally want to– Learn while interacting– Handle nonstationarity– Not take “too long” or use “too much” memory– Etc.
Perceptron
• Binary, linear classifier: Rosenblatt, 1957• Eventual failure of perceptron to do
“everything” shifted field of AI towards symbolic representations
• Sum = w1x1 + w2x2 + … + wnxn
• Output is +1 if sum > 0, -1 otherwise• wj = wj + (target – output) xj
• Also, can use x0 = 1 and w0 is therefore a bias
Perceptron
• Consider Perceptron with 3 weights:• x, y, bias
Spatial-based Perceptron Weights
Neural Networks
• How do we get around only linear solutions?
Neural Networks
• A multi-layer network of linear perceptrons is still linear.
• Non-linear (differentiable) units• Logistic or tanh function
Intermission
• UAS in France– http://www.uasvision.com/2014/02/18/18-year-ol
d-in-nancy-prosecuted-for-uas-video-on-youtube/
Gradient Descent
• w = (w1, w2, …, wn)T
• Assume Vt(s) sufficiently smooth differential function of w, for all states s in S
• Also, assume that training examples are in the form:
• Features of st Vπ(st)• Goal: minimize errors on the observed samples
• wt+1=wt + α[Vπ(st)-Vt(st)] wtVt(st)
• Vector of partial derivatives
Δ
• Let J(w) be any function of the weight space• The gradient at any point wt in this space is:
• Then, to iteratively move down the gradient:
• Why still doing this iteratively? If you could just eliminate the error, why could this be a bad idea?
• Common goal is to minimize mean-squared error (MSE) over distribution d:
• Why does this make any sense?• d is distribution of states receiving backups• on- or off-policy distribution
• (Dmitry)• Motivation for choosing MSE:– MSE - 2-norm of the error.– 1) Square of the norm is a sum of the squares, and
its derivative is a linear function, that is good.– 2) QR decomposition is used to get nice solution
for linear approximation problems• Find x which minimizes 2-norm of(A*x-b)• Other norms don't such simple solution
Gradient Descent
• Each sample gradient is an unbiased estimate of the true gradient• This will converge to a local minimum of the MSE if α decreases
“appropriately” over time
• Unfortunately, we don’t actually have vπ(s)• Instead, we just have an estimate of the target
Vt
• If the Vt is an an unbiased estimate of vπ(st), then we’ll converge to a local minimum (again with α caveat)
• δ is or normal TD error• e is vector of eligibility traces• θ is a weight vector
• Note that TD(λ) targets are biased• But… we do it anyway
Linear Methods
• Why are these a particularly important type of function approximation?
• Parameter vector θt
• Column vector of features φs for every state • (same number of components)
Linear Methods
• Gradient is simple• Error surface for MSE is simple (single
minimum)
• Coarse coding• Generalization based on features activating
Size Matters
Tile coding
Tile coding, view #2
• Consider a game of soccer
But, how do you pick the “coarseness”?
Adaptive tile codingIFSA
Irregular tilings
Radial Basis Functions
• Instead of binary, have degrees of activation
• Can combine with tile coding!
• Kanerva Coding: choose “prototype states” and consider distance from prototype states
• Now, updates depend on number of features, not number of dimensions
• Instance-based methods
Fitted R-Max [Jong and Stone, 2007]
x
y ?
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
Fitted R-Max [Jong and Stone, 2007]
x
y
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
Fitted R-Max [Jong and Stone, 2007]
x
y
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
Fitted R-Max [Jong and Stone, 2007]
• Instance-based RL method [Ormoneit & Sen, 2002]
• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action
x
y
Mountain-Car Task
3D Mountain Car
• X: position and acceleration• Y: position and acceleration
• Control with FA• Bootstrapping
Efficiency in ML / AI
1. Data efficiency (rate of learning)2. Computational efficiency (memory,
computation, communication)3. Researcher efficiency (autonomy, ease of
setup, parameter tuning, priors, labels, expertise)
• Todd’s work with decision trees• Course feedback