reinforcement learning in mdps by lease-square policy iteration
DESCRIPTION
Reinforcement Learning in MDPs by Lease-Square Policy Iteration. Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005. Outline. MDP and Q-function Value function approximation LSPI: Least-Square Policy Iteration Proto-value Functions - PowerPoint PPT PresentationTRANSCRIPT
Reinforcement Learning in MDPs
by Lease-Square Policy Iteration
Presented by Lihan He
Machine Learning Reading Group
Duke University
09/16/2005
Outline
MDP and Q-function
Value function approximation
LSPI: Least-Square Policy Iteration
Proto-value Functions
RPI: Representation Policy Iteration
An MDP is a model M = < S, A, T, R > a set of environment states S,a set of actions A,a transition function T: S A S [0,1] , T(s,a,s’) = P(s’| s,a),
a reward function R: S A R .
A policy is a function : S A.
Value function (expected cumulative reward) V: S R .
Satisfying Bellman Eq.: V(s) = R(s, (s)) + s’ P(s’| s, (s)) V(s’)
Markov Decision Process (MDP)
Markov Decision Process (MDP)
An example of grid world environment
+1 +1
0.9
0.9
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.5
Optimal policy Value function
State-action value function Q
The state-action value function Q(s,a) of any policy is defined over all
possible combinations of states and actions and indicates the expected,
discounted, total reward when taking action a in state s and following
policy thereafter.
Q(s,a) = R(s,a) + s’ P(s’| s,a) Q(s’, (s’))
Given policy , for each state-action pair, we have a Q(s,a) value.
In matrix format, above Bellman equation becomes
Q = R + P Q
Q , R: vectors of size |S||A|P: stochastic matrix of size (|S||A| |S|) P((s,a),s’) = P(s’|s,a)
How the policy iteration works?
Value Function Q
Policy
Policy improvementValue Evaluation
Q = R + P Q
Model
For model-free reinforcement learning, we don’t have model P.
Value function approximation
Let );,(ˆ wasQ be an approximation to ),( asQ with free parameters w:
waswaswasQ Tk
jjj ),(),();,(ˆ
1
i.e., Q values are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a.
.,...,1 ),,( kjasj ),( asj
Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that
.ˆ QQ
In general, k<<|S||A|, we use linear combination of only several bases to approximate value function Q.
Tk asasas ),(),(),( 1 where
Solve w evaluate and get updated policy Q̂ ),(ˆmaxarg)( asQs a
Value function approximation
Examples of basis functions:
22
2
2
21
1
1
)(
)(
1)(
)(
)(
1)(
),(
saaI
saaI
aaI
saaI
saaI
aaI
asPolynomials:
Radial basis functions (RBF)
Proto-value functions
Other manually designed bases based on specific problems
Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters.
Value function approximation
Least-Square Fixed-Point Approximation
Use Q = R + P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get
waswaswasQ Tk
jjj ),(),();,(ˆ
1
),(),(
),(),(
),(
),(
||||||||1
11111
||||
11
ASkAS
k
TAS
T
asas
asas
as
as
Let
We have wQ ˆ
Q̂
Rw TT )P(
Least-Square Policy Iteration
Solving the parameter w is equivalent to solving linear system bAw
where
)P( TA
Rb T
s a s
TssasasassP'
]))'(,'(),(),()[,|'(
s a s
sasRasassP'
)]',,(),()[,|'(
A is the sum of many matrices Tssasas ))'(,'(),(),(
b is the sum of many vectors )',,(),( sasRas
And they are weighted by transition probability ),|'( assP
Least-Square Policy Iteration
If we sampled data from underlying transition probability, samples:
Liiiii srasD 1)}',,,{(
A and b can be learned (in block) as
L
i
Tiiiiii ssasas
LA
1
]))(,(),(),([1~
L
iiii ras
Lb
1
),(1~
Or (real-time)
Ttttttttt ssasasAA ))(,(),(),(
~~ )()1(
ttttt rasbb ),(
~~ )()1(
Least-Square Policy Iteration
Input: D, k, φ, γ, ε, π0(w0)
π’ π0
wass Ta ),(maxarg)( %
repeat
π π’
until ||'|| ww
L
i
Tiiiiii ssasas
LA
1
]))(,(),(),([1~
L
iiii ras
Lb
1
),(1~
bAw~~~ 1
% value function update% could use real-time update
wass Ta
~),(maxarg)(' % policy update
,~ )1( tA )1(~ tb
Proto-Value Function
How to choose basis functions ),( as in LSPI algorithm?
Do not need to design bases manually
Data tell us what are the corresponding proto-value functions
Generate from topology of the underlying state space
Do not estimate underlying state transition probability
Capture the intrinsic smoothness constraints that true value functions have.
Proto-value functions are good bases for value function approximation.
1. Graph representation of the underlying state transition:
Proto-Value Function
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
1 1
1 1 1
1 1
1 1 1
1 1 1
1 1 1
1 1
1 1
1 1
1 1
1 1
2. Adjacency matrix A:
Proto-Value Function
3. Combinatorial Laplacian L: L=T - A
where T is the diagonal matrix whose entries are row sums of the adjacency matrix A
4. Proto value functions:
Eigenvectors of the combinatorial Laplacian L
ff L
Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),
Example of proto value functions:
G
20
21
0 210 420 630 840 1050 1260
0
210
420
630
840
1050
1260
Adjacency matrix
450 460 470 480 490 500
450
455
460
465
470
475
480
485
490
495
500
zoom in
Grid world: 1260 states
Proto-value functions: Low-order eigenvectors as basis functions
Optimal value function
Value function approximation using 10 proto-value functions as bases
Representation Policy Iteration (offline)
Input: D, k, γ, ε, π0(w0)
2. π’ π0
wass Ta ),(maxarg)( %
3. repeat
π π’
until ||'|| ww
L
i
Tiiiiii ssasas
LA
1
]))(,(),(),([1~
L
iiii ras
Lb
1
),(1~
bAw~~~ 1
% value function update
wass Ta
~),(maxarg)(' % policy update
1. Construct basis functions:
(a) Use sample D to learn a graph that encodes the underlying state space topology.
(b) Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph.
The basis functions φ(s,a) are produced by combining the k proto-value functions with
indicator function of action a.
Representation Policy Iteration (online)
Input: D0, k, γ, ε, π0(w0)
2. π’ π(0)
wass Ta ),(maxarg)( %
3. repeat
(a) π(t) π’
until ||'|| ww
Ttttttttt ssasasAA ))(,(),(),(
~~ )()1(
ttttt rasbb ),(
~~ )()1(
bAw~~~ 1
% value function update
wass Ta
~),(maxarg)(' % policy update
1. Initialization:
(b) execute π(t) to get new data D(t)={st, at, rt, s’t} .
Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get )0()0( ~,
~bA
(c) If new data sample D(t) changes the topology of G, compute a new set of basis functions.
(d)
(e)
Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0.
Optimal policy: 1-9, 26-41: Right; 11-25, 42-50 Left
Example: Chain MDP
Value function and approximation in each iteration
20 bases used
Example: Chain MDP
Steps to convergence
Policy L1 error with respect to optimal policy
Performance comparison
References:
M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), 1107-1149.
-- Give LSPI algorithm for reinforcement learning
S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005.
-- How to build basis function for LSPI algorithm
C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004.
-- An application of LSPI