Off-Policy Temporal-Difference
Learningwith Function Approximation
Doina Precup McGill University
Rich SuttonSanjoy Dasgupta
AT&T Labs
Off-policy Learning
Learning about a way of behaving without behaving (exactly) that way
Target policy must be part of source (behavior) policy
π(s,a)>0 ′ π (s,a)>0
E.g., Q-learning learns about the greedy policy while following something more exploratory
Learning about many macro-action policies at once
We need off-policy learning!
RL Algorithm Space
TD Linear FA
Off-policy
LinearTD()
Q-learning,options
stable
We needall 3
But we canonly get 2 at a time
Tsitsiklis & Van Roy 1997Tadic 2000
Baird 1995Gordon 1995
NDP 1996
Boom!
Baird’s Counterexample
θ0 + 2θ1
ε1 − ε
θ0 + 2θ2 θ0 + 2θ3 θ0 + 2θ4 θ0 + 2θ5
2θ0 + θ6
0
5
10
0 1000 2000 3000 4000 5000
10
10
/ -10
Iterations (k)
510
1010
010
-
-
Parametervalues, θk(i)(log scale,
broken at
Markov chain (no actions)
All states updated equally often, synchronously
Exact solution exists: θ = 0
Initial θ0 = (1,1,1,1,1,10,1)T
100%
±1)
Importance Sampling
Re-weighting samples according to their “importance,” correcting for a difference in sampling distribution
For example, any episode
e=s0 a0 r1 s1 a1 r2L sT−1 aT−1 rT sT
has probability
Pr(e|π)=p0(s0,a0)p(s0,s1,a0) ∏k=1
T−1π(sk,ak)p(sk,sk+1,ak)
under , so its importance is
Pr(e|π)Pr(e| ′ π )
=∏k=1
T−1π(sk,ak)′ π (sk,ak)
Corrects foroversampling
under ’
Naïve Importance Sampling Alg
Updatet = (Regular-linear-TD()-updatet)
Converts off-policy to on-policy
On-policy convergence theorem then appliesTsitsiklis & Van Roy, 1997Tadic, 2000
But variance is high, convergence is very slow
We can do better!
∏k=1
T−1π(sk,ak)′ π (sk,ak)
⎛ ⎝ ⎜
⎞ ⎠ ⎟
Approximate the action-value function:
as a linear form:
where is a feature vector representing s,a
and is the modifiable parameter vector
Linear Function Approximation
Qπ (s,a)=Eπ rt+1 +γrt+2 +L +rT st =s,at =a
≈
r θ T
r φ s,a = θ(i)φs,a(i)
i∑
r φ sa
r θ
Updating after each episode
Linear TD()
Per-Decision Importance-Sampled TD()
(st1,at1) (st1,at1)
θ ← θ + Δθtt=0
T−1
∑
0
Δθt =α rt+1 +γθTφst+1at+1−θTφstat[ ]φstat
0
Δθt =α rt+1 + γθTφst+1at+1
−θTφstat[ ]φstat∏k=1
π(sk,ak)′ π (sk,ak)
⎛ ⎝ ⎜
⎞ ⎠ ⎟
t
The new
Algorithm!
(see paper for general )
Main Result
Eπ Δθ s0,a0 =E ′ π Δθ s0,a0 ∀s0,a0
Total change over episode for new algorithm
Total change forconventional TD()
Convergence Theorem (based on Tsitsiklis & Van Roy 1997)
Under the usual assumptions, and one annoying assumption:
new algorithm converges to the same θ as on-policy
TD()
var ′ π ∏k=1
T−1π(sk,ak)′ π (sk,ak)
⎡
⎣ ⎢ ⎤
⎦ ⎥ <∞
e.g., bounded episode length
The variance assumption is restrictive
• Consider a modified MDP with bounded episode length– We have data for this MDP– Our result assures good convergence for this– This solution can be made close to the sol’n to original
problem– By choosing episode bound long relative to or the
mixing time
• Consider application to macro-actions– Here it is the macro-action that terminates– Termination is artificial, real process is unaffected– Yet all results directly apply to learning about macro-
actions– We can choose macro-action termination to satisfy the
variance condition
But can often be satisfied with “artificial” terminations
Empirical Illustration
Agent always starts at STerminal states marked GDeterministic actions
Behavior policy choosesup-down with 0.4-0.1 prob.
Target policy choosesup-down with 0.1-0.4
If the algorithm is successful, it should give positiveweight to rightmost feature, negative to the leftmost one
Trajectories of Two Components of θ
= 0.9 decreased
θ appears to converge as advertised
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
Episodes x 100,000
µlef tmost ,down
µlef tmost ,down
µr ightmost ,down
*
µr ightmost ,down*
Comparison of Naïve and PD IS Algs
1
1.5
2
-12 -13 -14 -15 -16 -17
2.5
RootMean
SquaredError
Naive IS
Per-Decision IS
Log2
= 0.9 constant
(after 100,000episodes, averaged
over 50 runs)
Precup, Sutton & Dasgupta, 2001
Can Weighted IS help the variance?
Return to the tabular case, consider two estimators:
QnIS(s,a)=1
n Riwii=1
n
∑ith return following s,a at time t
IS correction product
converges with finite variance iff the wi have finite variance
QnISW(s,a)=
Riwii=1
n
∑
wii=1
n
∑
converges with finite variance even if the wi have infinite variance
Can this be extendedto the FA case?
∏k=t+1
T−1 π(sk,ak)′ π (sk,ak)
Restarting within an Episode
• We can consider episodes to start at any time
• This alters the weighting of states,– But we still converge,– And to near the best answer (for the new weighting)
Incremental Implementation
At the start of each episode:
c0 =g0r e 0 =c0φ0
On each step: st at → rt+1 st+1 at+1 0<t<T
ρt+1 =π(st+1,at+1) ′ π (st+1,at+1)
δt =rt+1 +γ ρt+1θTφt+1−θTφt
Δθt =αδtr e t
ct+1 =ρt+1ct +gt+1r e t+1 =γ λ ρt+1
r e t +ct+1φt+1
Conclusion
• First off-policy TD methods with linear FA– Certainly not the last– Somewhat greater efficiencies are undoubtedly
possible
• But the problem is so important• Can’t we do better?
– Is there no other approach?– Something other than importance sampling?
• I can’t think of a credible alternative approach
• Perhaps experimentation in a nontrivial domain would suggest other possibilities...