mit6 231f11 notes short
TRANSCRIPT
-
8/12/2019 MIT6 231F11 Notes Short
1/125
APPROXIMATE DYNAMIC PROGRAMMING
A SERIES OF LECTURES GIVEN AT
CEA - CADARACHE
FRANCE
SUMMER 2012
DIMITRI P. BERTSEKAS
These lecture slides are based on the book:
Dynamic Programming and Optimal Con-trol: Approximate Dynamic Programming,Athena Scientific, 2012; see
http://www.athenasc.com/dpbook.html
For a fuller set of slides, see
http://web.mit.edu/dimitrib/www/publ.html
1
http://www.athenasc.com/dpbook.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://www.athenasc.com/dpbook.html -
8/12/2019 MIT6 231F11 Notes Short
2/125
APPROXIMATE DYNAMIC PROGRAMMING
BRIEF OUTLINE I
Our subject:
Large-scale DP based on approximations andin part on simulation.
This has been a research area of great inter-est for the last 20 years known under variousnames (e.g., reinforcement learning, neuro-dynamic programming)
Emerged through an enormously fruitful cross-fertilization of ideas from artificial intelligence
and optimization/control theory Deals with control of dynamic systems under
uncertainty, but applies more broadly (e.g.,discrete deterministic optimization)
A vast range of applications in control the-ory, operations research, artificial intelligence,and beyond ...
The subject is broad with rich variety oftheory/math, algorithms, and applications.Our focus will be mostly on algorithms ...less on theory and modeling
2
-
8/12/2019 MIT6 231F11 Notes Short
3/125
APPROXIMATE DYNAMIC PROGRAMMING
BRIEF OUTLINE II
Our aim:
A state-of-the-art account of some of the ma-jor topics at a graduate level
Show how the use of approximation and sim-ulation can address the dual curses of DP:
dimensionality and modeling Our 7-lecture plan:
Two lectures on exact DPwith emphasis oninfinite horizon problems and issues of large-scale computational methods
One lecture on general issues of approxima-tion and simulationfor large-scale problems
One lecture on approximate policy iterationbased on temporal differences (TD)/projectedequations/Galerkin approximation
One lecture on aggregation methods
One lecture onstochastic approximation, Q-learning, and other methods
One lecture on Monte Carlo methods forsolving general problems involving linear equa-tions and inequalities
3
-
8/12/2019 MIT6 231F11 Notes Short
4/125
APPROXIMATE DYNAMIC PROGRAMMING
LECTURE 1
LECTURE OUTLINE
Introduction to DP and approximate DP
Finite horizon problems The DP algorithm for finite horizon problems
Infinite horizon problems
Basic theory of discounted infinite horizon prob-lems
4
-
8/12/2019 MIT6 231F11 Notes Short
5/125
BASIC STRUCTURE OF STOCHASTIC DP
Discrete-time system
xk+1=fk(xk, uk, wk), k= 0, 1, . . . , N 1 k: Discrete time xk: State;summarizes past information that
is relevant for future optimization uk: Control; decision to be selected at timek from a given set
wk: Random parameter(also called distur-bance or noise depending on the context)
N: Horizon or number of times control isapplied
Cost function that is additive over time
E N1
gN(xN) + gk(xk, uk, wk)k
=0
Alternative system description:P(xk+1|xk, uk)
xk+1 =wk with P(wk |xk, uk) =P(xk+1 |xk, uk)
5
-
8/12/2019 MIT6 231F11 Notes Short
6/125
INVENTORY CONTROL EXAMPLE
Inventory
System
Stock Ordered at
Period k
Stock at Period k Stock at Period k + 1
Demand at Period k
xk
wk
xk + 1= xk + uk - wk
ukCost ofP eriod k
c uk+ r (xk + uk- wk)
Discrete-time system
xk+1 =fk(xk, uk, wk) =xk+ uk wk
Cost function that is additive over time
E N1
gN(xN) + gk(xk, uk, wk)k
N
=0
=E
1
cuk+ r(xk+ ukk=0
wk)
6
-
8/12/2019 MIT6 231F11 Notes Short
7/125
ADDITIONAL ASSUMPTIONS
Optimization over policies:These are rules/function
uk =k(xk), k= 0, . . . , N 1
that map states to controls (closed-loop optimiza-tion, use of feedback)
The set of values that the control uk can takedepend at most on xk and not on prior x or u
Probability distribution ofwk does not dependon past values wk1, . . . , w0, but may depend onxk and uk
Otherwise past values of w or x would beuseful for future optimization
7
-
8/12/2019 MIT6 231F11 Notes Short
8/125
GENERIC FINITE-HORIZON PROBLEM
Systemxk+1 =fk(xk, uk, wk),k= 0, . . . , N 1 Control contraintsuk Uk(xk) Probability distributionPk( |xk, uk) ofwk
Policies = {0, . . . , N1}, where k mapsstates xk into controls uk = k(xk) and is such
that k(xk) Uk(xk) for all xk Expected costof starting at x0 is
N1
J(x0) =E
gN(xN) + gk(xk, k(xk), wk)k=0
Optimal cost function
J(x0) = min J(x0)
Optimal policy
satisfies
J(x0) =J(x0)
When produced by DP, is independent ofx0.
8
-
8/12/2019 MIT6 231F11 Notes Short
9/125
PRINCIPLE OF OPTIMALITY
Let
={
,
0 1, . . . , N1} be optimal policy Consider thetail subproblemwhereby we areat xk at time k and wish to minimize the cost-to-go from time k to time N
E N1
gN(xN) +
g
x, (x), w=k
and thetail policy {k,
k+1, . . . ,
N1}
!"#$ &'()*+($,-
!#-,!!
"!
#
Principle of optimality:The tail policy is opti-mal for the tail subproblem (optimization of the
future does not depend on what we did in the past) DP solves ALL the tail subroblems
At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution ofthe tail subproblems of shorter time length
9
-
8/12/2019 MIT6 231F11 Notes Short
10/125
DP ALGORITHM
Jk(xk): opt. cost of tail problem starting at xk
Start with
JN(xN) =gN(xN),
and go backwards using
Jk(xk) = min E gk(xk, uk, wk)ukUk(xk) wk
+ Jk+1 fk(xk, u
k, wk) , k= 0, 1, . . . , N 1
i.e., to solve ta
il subproblem a
t time k minimize
Sum ofkth-stage cost + Opt. cost of next tail proble
starting from next state at time k+ 1
ThenJ0(x0), generated at the last step, is equalto the optimal cost J(x0). Also, the policy
={0, . . . , N1}
wherek(xk) minimizes in the right side above foreach xk andk, is optimal
Proof by induction10
-
8/12/2019 MIT6 231F11 Notes Short
11/125
PRACTICAL DIFFICULTIES OF DP
The curse of dimensionality Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases
Quick explosion of the number of states incombinatorial problems
Intractability of imperfect state informationproblems
The curse of modeling
Sometimes a simulator of the system is easierto construct than a model
There may be real-time solution constraints
A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice
The problem data may change as the systemis controlled need for on-line replanning
All of the above are motivations for approxi-mation and simulation
11
-
8/12/2019 MIT6 231F11 Notes Short
12/125
COST-TO-GO FUNCTION APPROXIMATION
Use a policy computed from the DP equationwhere the optimal cost-to-go function Jk+1 is re-placed by an approximation Jk+1.
Apply k(xk), which attains the minimum in
min E
g ( J
k xk, uk, wk)+ k+1
fk(xk, uk, wk)ukUk(xk)
Some approaches:
(a) Problem Approximation:Use Jkderived froma related but simpler problem
(b) Parametric Cost-to-Go Approximation:Useas Jk a function of a suitable parametricform, whose parameters are tuned by someheuristic or systematic scheme (we will mostlyfocus on this)
This is a major portion of ReinforcementLearning/Neuro-Dynamic Programming
(c) Rollout Approach: Use as Jk the cost ofsome suboptimal policy, which is calculatedeither analytically or by simulation
12
-
8/12/2019 MIT6 231F11 Notes Short
13/125
ROLLOUT ALGORITHMS
At each k and state xk, use the control k(xk)that minimizes in
min E g k(xk, uk, wk)+Jk+1ukUk(xk)
whereJ
is the cost-to-go of so
fk(xk, uk, wk)
,
k+1 me heuristic pol-
icy (called the base policy).
Cost improvement property: The rollout algo-rithm achieves no worse (and usually much better)cost than the base policy starting from the samestate.
Main difficulty: Calculating Jk+1(x) may becomputationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.
May involve Monte Carlo simulation if theproblem is stochastic.
Things improve in the deterministic case. Connection w/ Model Predictive Control (MPC)
13
-
8/12/2019 MIT6 231F11 Notes Short
14/125
INFINITE HORIZON PROBLEMS
Same as the basic problem, but: The number of stages is infinite. The system is stationary.
Total cost problems:Minimize
N1
J k(x0) = lim E g xk, k(xk), wkN wk
k=0,1,...
k
=0
Discounted problems (
-
8/12/2019 MIT6 231F11 Notes Short
15/125
DISCOUNTED PROBLEMS/BOUNDED COST
Stationary system
xk+1 =f(xk, uk, wk), k= 0, 1, . . .
Cost of a policy ={0, 1, . . .}
N1J k(x0) = lim E g xk, k(xk), wk
N wkk=0,1,...
k
=0
with < 1, and g is bounded [for some M, wehave|g(x,u,w)| M for all (x,u,w)] Boundedness ofg guarantees that all costs arewell-defined and bounded: J(x) M1 All spaces are arbitrary
- only
boundedness of
g is important (there are math fine points, e.g.measurability, but they dont matter in practice)
Important special case: All underlying spacesfinite; a (finite spaces) Markovian Decision Prob-lemor MDP
All algorithms essentially work with an MDPthat approximates the original problem
15
-
8/12/2019 MIT6 231F11 Notes Short
16/125
SHORTHAND NOTATION FOR DP MAPPING
For any function J ofx
(T J)(x) = min E
g(x,u,w) + J
f(x,u,w)uU(x) w
, x
T J is the optimal cost function for the one-
stage problem with stage cost g and terminal costfunction J.
T operates on bounded functions ofx to pro-duce other bounded functions ofx
For any stationary policy
(TJ)(x) =E
g
x, (x), w
+ J
f(x, (x), w)
, xw
The critical structure of the problem is cap-tured in T and T
The entire theory of discounted problems canbe developed in shorthand using T andT
This is true for many other DP problems
16
-
8/12/2019 MIT6 231F11 Notes Short
17/125
FINITE-HORIZON COST EXPRESSIONS
Consider an N-stage policyN
0 ={0, 1, . . . , N1with a terminal cost J: N1
JN(x0) =E NJ(xk) +
g
x, (x), w0
=0
=E
g
x0, 0(x0), w0
+JN(x1)1
= (T0JN)(x0)
1
where N1 ={1, 2, . . . , N1}
By induction we have
JN(x) = (T0T1 TN1J)(x),0
x For a stationary policytheN-stage cost func-tion (with terminal cost J) is
JN =TN
J0
where TN is the N-fold composition ofT
Similarly the optimal N-stage cost function(with terminal cost J) is TNJ
TNJ=T(TN1J) is just the DP algorithm17
-
8/12/2019 MIT6 231F11 Notes Short
18/125
SHORTHAND THEORY A SUMMARY
Infinite horizon cost function expressions[withJ0(x) 0]
J(x) = lim (T N
0T1 TNJ0)(x), J(x) = lim (T J0)(N N
Bellmans equation:J =T J, J=TJ
Optimality condition:
: optimal T J =T J
Value iteration:For any (bounded) J
J(x) = lim (TkJ)(x),k
x
Policy iteration:Given k,
Policy evaluation:Find Jk by solving
Jk =TkJk
Policy improvement: Find k+1 such that
Tk+1Jk =T Jk
18
-
8/12/2019 MIT6 231F11 Notes Short
19/125
TWO KEY PROPERTIES
Monotonicity property:For any J and J
suchthat J(x) J(x) for all x, and any
(T J)(x) (T J)(x), x,
(T J)(x)
(T J )(x),
x.
Constant Shift property:For any J, any scalarr, and any
T(J+ re)
(x) = (T J)(x) +r, x,
T(J+ re) (x) = (TJ)(x) +r, x,whe
re e is the u
nit function [e(x) 1].
Monotonicity is present in all DP models (undis-counted, etc)
Constant shift is special to discounted models Discounted problems have another propertyof major importance: T and T are contractionmappings(we will show this later)
19
-
8/12/2019 MIT6 231F11 Notes Short
20/125
CONVERGENCE OF VALUE ITERATION
IfJ0 0,J(x) = lim (TkJ0)(x), for allx
k
Proof: For any initial state x0, and policy ={0, 1, . . .},
J(x0) =E
g
x, (x), w=0
=E
k1
gx, (x), w
=0
+ E
g x, (x), w=k
The tail portion satisfies
E
kMg x, (x), w
, 1=k
where M |g(x,u,w)
|. Take the
min over of
both sides. Q.E.D.20
-
8/12/2019 MIT6 231F11 Notes Short
21/125
BELLMANS EQUATION
The optimal cost functionJ
satisfies BellmansEq., i.e. J =T J.
Proof:For all x and k,
kM kMJ(x) (TkJ0)(x) J(x) + ,
1
1
where J0(x) 0 and M |g(x,u,w)|. ApplyingT to this relation, and using Monotonicity andConstant Shift,
k+1M
(T J
)(x) (Tk+1
J0)(x)1 k+1M (T J)(x) +
1 Taking the limit as k and using the fact
lim (Tk+1J0)(x) =J(x)k
we obtain J =T J. Q.E.D.
21
-
8/12/2019 MIT6 231F11 Notes Short
22/125
THE CONTRACTION PROPERTY
Contraction property: For any bounded func-tionsJ and J, and any ,
max(T J)(x) (T J)(x)
x
max J(x) J(x) ,x
max(TJ)(x)
(TJ)(x)
max J(x)
J(x) .
x x
Proof:Denote c= maxx
S
J(x)
J(x)
. Then
J(x) c J(x) J(
x) + c,
x
Apply Tto both sides, and use the Monotonicityand Constant Shift properties:
(T J)(x) c (T J)(x) (T J)(x) +c, x
Hence
(T J)(x) (T J)(x) c, x.Q.E.D.
22
-
8/12/2019 MIT6 231F11 Notes Short
23/125
NEC. AND SUFFICIENT OPT. CONDITION
A stationary policy is optimal if and only if(x) attains the minimum in Bellmans equationfor each x; i.e.,
T J =TJ.
Proof:IfT J =T J, then using Bellmans equa-tion (J =T J), we have
J =TJ,
so by uniqueness of the fixed point ofT, we obtain
J =J; i.e., is optimal.
Conversely, if the stationary policyis optimal,we have J =J, so
J =TJ.
Combining this with Bellmans Eq. (J = T J),we obtain T J =TJ. Q.E.D.
23
-
8/12/2019 MIT6 231F11 Notes Short
24/125
APPROXIMATE DYNAMIC PROGRAMMING
LECTURE 2
LECTURE OUTLINE
Review of discounted problem theory
Review of shorthand notation Algorithms for discounted DP
Value iteration
Policy iteration
Optimistic policy iteration Q-factors and Q-learning
A more abstract view of DP
Extensions of discounted DP
Value and policy iteration Asynchronous algorithms
24
-
8/12/2019 MIT6 231F11 Notes Short
25/125
DISCOUNTED PROBLEMS/BOUNDED COST
Stationary system with arbitrary state space
xk+1 =f(xk, uk, wk), k= 0, 1, . . .
Cost of a policy ={0, 1, . . .}
N1J k(x0) = lim E
N wkk=0,1,...
k
g
=0
xk, k(xk), wk
with
-
8/12/2019 MIT6 231F11 Notes Short
26/125
SHORTHAND THEORY A SUMMARY
Cost function expressions[with J0(x) 0]J(x) = lim (T0T 1 TkJ0)(x), J(x) = lim (T
k J0)(x)
k k
Bellmans equation:J =T J, J=TJ or
J(x) = min E
g(x,u,w) +Jf(x,u,w) , xuU(x) wJ(x) =E
g
x, (x), ww
+J
f(x, (x), w)
, x
Optimality condition:
: optimal TJ =T J
i.e.,
(x) arg min Eg(x,u,w) +J fuU(x) w (x,u,w) , Value iteration:For any (bounded) J
J(x) = lim (TkJ)(x), xk
26
-
8/12/2019 MIT6 231F11 Notes Short
27/125
MAJOR PROPERTIES
Monotonicity property:For any functionsJandJ on the state space X such that J(x) J(x)for all x X, and any
(T J)(x) (T J)(x), (TJ)(x) (TJ)(x), x X
Contraction property: For any bounded func-tionsJ and J, and any ,
max(T J)(x) (T J)(x)
maxJ(x) J(x)x x
,
max(T JJ)(x)(T )(x) x maxx J(x)J(x).
Compact Contraction Notation:
T JT J JJ, TJTJ JJ,
where for any bounded function J, we denote byJ the sup-norm
J = max J(x) .xX
27
-
8/12/2019 MIT6 231F11 Notes Short
28/125
THE TWO MAIN ALGORITHMS: VI AND PI
Value iteration:For any (bounded) J
J(x) = lim (TkJ)(x),k
x
Policy iteration:Given k
Policy evaluation:Find Jk by solving
Jk(x) =E
g
x, (x), w
+ Jk
f(x, k(x), w)w
, x
or Jk =TkJk
Policy improvement:Letk+1 be such that
k+1(x)arg min E g(x,u,w) + Jk f(x,u,w) ,
uU(x) w
or Tk+1Jk =T Jk
For finite state space policy evaluation is equiv-
alent to solving a linear system of equations Dimension of the system is equal to the numberof states.
For large problems, exact PI is out of the ques-tion (even though it terminates finitely)
28
-
8/12/2019 MIT6 231F11 Notes Short
29/125
INTERPRETATION OF VI AND PI
29
J = TJ
J = TJ
TJ
TJ
J
45 Degree Line
J = TJ
J
J1 = T1J1
Policy Improvement
Policy Improvement
T1J
Policy Evaluation
J0
J0
J0
J0
TJ0
TJ0
TJ0
T2J0
T2J0
Value Iterations
-
8/12/2019 MIT6 231F11 Notes Short
30/125
JUSTIFICATION OF POLICY ITERATION
We can show that Jk+1 Jk for all k Proof:For given k, we have
Tk+1Jk =T Jk TkJk =Jk
Using the monotonicity property of DP,
Jk T 2 Nk+1Jk Tk+1Jk lim Tk+1JkN
Sincelim TN
k+1Jk =Jk+1
N
we have Jk Jk+1 . IfJk =Jk+1 , thenJk solves Bellmans equa-tion and is therefore equal to J
So at iterationkeither the algorithm generates
a strictly improved policy or it finds an optimalpolicy
For a finite spaces MDP, there are finitely manystationary policies, so the algorithm terminateswith an optimal policy
30
-
8/12/2019 MIT6 231F11 Notes Short
31/125
APPROXIMATE PI
Suppose that the policy evaluation is approxi-mate,
Jk Jk , k= 0, 1, . . .
and policy improvement is approximate,
Tk+1Jk T Jk , k= 0, 1, . . .
where and are some positive scalars.
Error Bound I: The sequence {k} generated
by approximate policy iteration satisfies+ 2
limsup
Jkk
J (1 )2
Typical practical behavior:The method makessteady progress up to a point and then the iteratesJ llate within a neighborhood ofJk osci .
Error Bound II:If in addition the sequence{k}terminates at ,
J
J
+ 2
1 31
-
8/12/2019 MIT6 231F11 Notes Short
32/125
OPTIMISTIC POLICY ITERATION
Optimistic PI (more efficient):This is PI, wherepolicy evaluation is done approximately, with afinite number of VI
So we approximate the policy evaluation
J
Tm J
for some number m [1, ) Shorthand definition:For some integers mk
TkJk =T Jk, Jk+1 =Tmk
k Jk, k= 0, 1, . . .
Ifmk 1 it becomes VI Ifmk = it becomes PI Can be shown to converge (in an infinite numberof iterations)
32
-
8/12/2019 MIT6 231F11 Notes Short
33/125
Q-LEARNING I
We can write Bellmans equation as
J(x) = min Q(x, u),uU(x)
x,
where Q is the unique solution of
Q(x, u) =E
g(x,u,w) + min Q(x, v)
vU(x)
with x=f(x,u,w)
Q(x, u) is called the optimal Q-factorof (x, u)
We can equivalently write the VI method as
Jk+1(x) = min Qk+1(x, u),uU(x)
x,
where Qk+1 is generated by
Qk+1(x, u) =E
g(x,u,w) + min Qk(x, v)
vU(x)
with x=f(x,u,w)
33
-
8/12/2019 MIT6 231F11 Notes Short
34/125
Q-LEARNING II
Q-factors are no different than costs
They satisfy a Bellman equationQ =F Qwhere
(F Q)(x, u) =E
g(x,u,w) + min Q(x, v)
vU(x)
where x=f(x,u,w)
VI and PI for Q-factors are mathematicallyequivalent to VI and PI for costs
They require equal amount of computation ...they just need more storage
Having optimal Q-factors is convenient whenimplementing an optimal policy on-line by
(x) = min Q(x, u)uU(x)
Once Q(x, u) are known, the model [g and
E{}] is not needed. Model-free operation.
Later we will see how stochastic/sampling meth-ods can be used to calculate (approximations of)Q(x, u) using a simulator of the system (no modelneeded)
34
-
8/12/2019 MIT6 231F11 Notes Short
35/125
A MORE GENERAL/ABSTRACT VIEW
Let Y be a real vector space with a norm A function F :Y Y is said to be a contrac-tion mappingif for some (0, 1), we have
F y F z y z, for ally, z Y.
is called the modulus of contractionofF.
Important example:LetXbe a set (e.g., statespace in DP), v : X be a positive-valuedfunction. Let B(X) be the set of all functionsJ : X
such that J(x)/v(x) is bounded over
x. We define a norm onB(X), called the weightedsup-norm, by
J |J(x)| = max .xX v(x)
Important special case:The discounted prob-lem mappings T and T [for v(x) 1, =].
35
-
8/12/2019 MIT6 231F11 Notes Short
36/125
A DP-LIKE CONTRACTION MAPPING
Let X={1, 2, . . .}, and let F :B(X)
B(X)be a linearmapping of the form
(F J)(i) =bi+j
aijJ(j),
X
i= 1, 2, . . .
where bi and aij are some scalars. Then F is acontraction with modulus if and only ifjX|aij | v(j) , i= 1, 2, . . .
v(i)
Let F : B(X) B(X) be a mapping of theform
(F J)(i) = min(FJ)(i), i= 1, 2, . . .M
where M is parameter set, and for each M,F is a contraction mapping from B(X) to B(X)with modulus. ThenFis a contraction mappingwith modulus .
Allows the extension of main DP results frombounded cost to unbounded cost.
36
-
8/12/2019 MIT6 231F11 Notes Short
37/125
CONTRACTION MAPPING FIXED-POINT TH
Contraction Mapping Fixed-Point Theorem:IfF :B(X) B(X) is a contraction with modulus (0, 1), then there exists a unique J B(X)such that
J =F J.
Furthermore, ifJ is any function in B(X), then{FkJ} converges to J and we have
FkJ J kJ J, k= 1, 2, . . . .
This is a special case of a general result for
contraction mappings F : Y Y over normedvector spacesY that arecomplete: every sequence{yk} that is Cauchy (satisfiesym yn 0 asm, n ) converges. The space B(X) is complete (see the text for a
proof).
37
-
8/12/2019 MIT6 231F11 Notes Short
38/125
GENERAL FORMS OF DISCOUNTED DP
We consider an abstract form of DP based on
monotonicity and contraction Abstract Mapping:Denote R(X): set of real-valued functions J :X , and let H :X UR(X) be a given mapping. We consider themapping
(T J)(x) = min H(x,u,J),uU(x)
x X.
We assume that (T J)(x)> for all x X,so T maps R(X) into R(X).
Abstract Policies: Let M be the set of poli-cies, i.e., functions such that (x) U(x) forall x X. For each M, we consider the mappingT :R(X) R(X) defined by
(TJ)(x) =H
x, (x), J
, x X. Find a function J R(X) such that
J(x) = min H(x,u,J), XuU( )
xx
38
-
8/12/2019 MIT6 231F11 Notes Short
39/125
EXAMPLES
Discounted problems(and stochastic shortest
paths-SSP for = 1)
H(x,u,J) =E
g(x,u,w) +J
f(x,u,w)
Discounted Semi-Markov Problems
n
H(x,u,J) =G(x, u) +
mxy (u)J(y)y=1
where mxy are discounted transition probabili-ties, defined by the transition distributions
Shortest Path Problemsaxu+ J(u) ifu=d,H(x,u,J) =axd ifu=d
wheredis the destination. There is also a stochas-tic version of this problem.
Minimax Problems
H(x,u,J) = max g(x,u,w)+J f(x,u,w)wW(x,u)
39
-
8/12/2019 MIT6 231F11 Notes Short
40/125
ASSUMPTIONS
Monotonicity assumption:IfJ, J
R(X) andJ J, then
H(x,u,J) H(x,u,J), x X, u U(x)
Contraction assumption:
For every J B(X), the functions TJ andT Jbelong to B(X).
For some (0, 1), and all and J, J B(X), we have
T J T J J J
We can show all the standard analytical andcomputational results of discounted DP based onthese two assumptions
With just the monotonicity assumption (as inthe SSP or other undiscounted problems) we canstill show various forms of the basic results underappropriate assumptions
40
-
8/12/2019 MIT6 231F11 Notes Short
41/125
RESULTS USING CONTRACTION
Proposition 1: The mappings T and T areweighted sup-norm contraction mappings with mod-ulus over B(X), and have unique fixed pointsin B(X), denoted J and J, respectively (cf.Bellmans equation).
Proof: From the contraction property ofH. Proposition 2:For any J B(X) and M,
lim Tk J=J, lim TkJ=Jk k
(cf. convergence of value iteration).
Proof: From the contraction property ofT andT.
Proposition 3: We have T J = T J if andonly ifJ =J (cf. optimality condition).
Proof: T J = T J, then T J = J , implyingJ = J. Conversely, if J = J , then TJ =TJ =J=J =T J.
41
-
8/12/2019 MIT6 231F11 Notes Short
42/125
RESULTS USING MON. AND CONTRACTION
Optimality of fixed point:
J(x) = min J(x), x XM
Furthermore, for every >0, there exists Msuch that
J(x) J(x) J(x) +, x X
Nonstationary policies: Consider the set ofall sequences = {0, 1, . . .} with k
M for
all k, and define
J(x) = liminf(T0T1 TkJ)(x), xk
X,
with J being any function (the choice of J does
not matter) We have
J(x) = min J(x), x X
42
-
8/12/2019 MIT6 231F11 Notes Short
43/125
THE TWO MAIN ALGORITHMS: VI AND PI
Value iteration:For any (bounded) J
J(x) = lim (TkJ)(x),k
x
Policy iteration:Given k
Policy evaluation:Find Jk by solving
Jk =TkJk
Policy improvement:Find k+1 such that
Tk+1Jk =T Jk
Optimistic PI:This is PI, where policy evalu-ation is carried out by a finite number of VI
Shorthand definition: For some integers mk
Tk
Jk =T Jk, Jk+1 =T
mk
k Jk, k= 0, 1, Ifmk 1 it becomes VI Ifmk = it becomes PI For intermediate values ofmk, it is generally
more efficient than either VI or PI43
-
8/12/2019 MIT6 231F11 Notes Short
44/125
ASYNCHRONOUS ALGORITHMS
Motivation for asynchronous algorithms
Faster convergence Parallel and distributed computation Simulation-based implementations
General framework: Partition X into disjoint
nonempty subsets X1, . . . , X m, and use separateprocessor updating J(x) for x X Let Jbe partitioned as
J= (J1, . . . , J m),
where J is the restriction ofJon the set X. Synchronous algorithm:
Jt+1 (x) =T(Jt1, . . . , J
tm)(x), x X, = 1, . . . , m
Asynchronous algorithm:For some subsets of
times R,
1(( t)
T J , . . . , J
m(t)
Jt+1 (x) = 1 m )(x) ift R,Jt(x) t /
if R
where t j (t) are communication delays 44
-
8/12/2019 MIT6 231F11 Notes Short
45/125
ONE-STATE-AT-A-TIME ITERATIONS
Important special case:Assume n states, aseparate processor for each state, and no delays
Generate a sequence of states{x0, x1, . . .}, gen-erated in some way, possibly by simulation (eachstate is generated infinitely often)
Asynchronous VI:
Jt+1 T(Jt1, . . . , J
tn)() if=xt,
=
Jt if=x
t,
where T(Jt1, . . . , J tn)() denotes the -th compo-
nent of the vector
T(Jt1, . . . , J tn) =T Jt,
and for simplicity we write Jt instead ofJt()
The special case where
{x0, x1, . . .}={1, . . . , n , 1, . . . , n , 1, . . .}
is the Gauss-Seidel method
We can show that Jt J under the contrac-
tion assumption45
-
8/12/2019 MIT6 231F11 Notes Short
46/125
ASYNCHRONOUS CONV. THEOREM I
Assume that for all, j = 1, . . . , m,Ris infinite
and limtj (t) = Proposition:Let Thave a unique fixed point J,and assume that there is a sequence of nonemptysubsets
S(k)
R(X) with S(k+ 1) S(k) forall k, and with the following properties:
(1) Synchronous Convergence Condition: Ev-ery sequence {Jk} with Jk S(k) for eachk, converges pointwise to J. Moreover, wehave
T J
S(k+1),
J
S(k), k= 0, 1, . . . .
(2) Box Condition:For allk,S(k) is a Cartesianproduct of the form
S(k) =S1(k) Sm(k),
where S(k) is a set of real-valued functionson X, = 1, . . . , m.
Then for every J S(0), the sequence {Jt} gen-erated by the asynchronous algorithm convergespointwise to J.
46
-
8/12/2019 MIT6 231F11 Notes Short
47/125
ASYNCHRONOUS CONV. THEOREM II
Interpretation of assumptions:
S(0)S(k)
S(k+ 1) J
J= (J1, J2)
S1(0)
S2(0)TJ
A synchronous iteration from anyJ inS(k) movesinto S(k+ 1) (component-by-component)
Convergence mechanism:
S(0)S(k)
S(k+ 1) J
J= (J1, J2)
J1 Iterations
J2 Iteration
Key: Independent component-wise improve-ment.An asynchronous component iteration fromany J in S(k) moves into the corresponding com-ponent portion ofS(k+ 1)
47
-
8/12/2019 MIT6 231F11 Notes Short
48/125
APPROXIMATE DYNAMIC PROGRAMMING
LECTURE 3
LECTURE OUTLINE
Review of theory and algorithms for discountedDP
MDP and stochastic shortest path problems(briefly)
Introduction to approximation in policy andvalue space
Approximation architectures Simulation-based approximate policy iteration
Approximate policy iteration and Q-factors
Direct and indirect approximation
Simulation issues
48
-
8/12/2019 MIT6 231F11 Notes Short
49/125
DISCOUNTED PROBLEMS/BOUNDED COST
Stationary system with arbitrary state space
xk+1 =f(xk, uk, wk), k= 0, 1, . . .
Cost of a policy ={0, 1, . . .}
N1J k(x0) = lim E g xk, k(xk), wk
N wkk=0,1,...
k
=0
with
-
8/12/2019 MIT6 231F11 Notes Short
50/125
MDP - TRANSITION PROBABILITY NOTATIO
Assume the system is an n-state (controlled)Markov chain
Change to Markov chain notation
States i= 1, . . . , n (instead ofx) Transition probabilitiespikik+1(uk)[instead
ofxk+1=f(xk, uk, wk)] Stage cost g(ik, uk, ik+1)[instead ofg(xk, uk, wk Cost of a policy ={0, 1, . . .}
N1
J(i) = lim E kg ik, k(ik), ik+1 i0=N ik
k=1,2,... k=0 |
Shorthand notation for DP mappings
n
(T J)(i) = min pij (u)uU(i)j=1
g(i,u,j)+J(j), i= 1, . . . ,n
(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .
j=1
50
-
8/12/2019 MIT6 231F11 Notes Short
51/125
SHORTHAND THEORY A SUMMARY
Cost function expressions[with J0(i) 0]J(i) = lim (T0T 1 TkJ0)(i), J(i) = lim (T
k J0)(i)
k k
Bellmans equation:J =T J, J=TJ or
n
J(i) = min
p ij (u) g(i,u,j) +J (j) , iuU(i)
j=1
n
J(i) =
pij
(i)
g
i, (i), j
+ J(j)
j=1 , i
Optimality condition:
: optimal TJ =T J
i.e.,
n
(i) arg min
p ij (u) (u (i)
j=1
g i,u,j)+J (j)
, i
U
51
-
8/12/2019 MIT6 231F11 Notes Short
52/125
THE TWO MAIN ALGORITHMS: VI AND PI
Value iteration:For any J n
J(i) = lim (TkJ)(i),k
i= 1, . . . , n
Policy iteration:Given k
Policy evaluation:Find Jk by solving
n
Jk(i) =
p k k
ij
(i)
g
i, (i), j
+Jk (j)
, i= 1, . . .
j=1
or Jk =TkJk
Policy improvement:Letk+1 be such thatn
k+1(i)arg min
pij (u)
g(i,u,j)+Jk(j)
uU(i)j=1
, i
or Tk+1Jk =T Jk
Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)
52
-
8/12/2019 MIT6 231F11 Notes Short
53/125
TOCHASTIC SHORTEST PATH (SSP) PROBLE
Involves states i= 1, . . . , nplus a special cost-free and absorbing termination state t
Objective: Minimize the total (undiscounted)cost. Aim: Reach t at minimum expected cost
An example: Tetris
!"#$%&'!%(&
))))))
53
http://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairuse -
8/12/2019 MIT6 231F11 Notes Short
54/125
SSP THEORY
SSP problems provide a soft boundary be-tween the easy finite-state discounted problemsand the hard undiscounted problems.
They share features of both. Some of the nice theory is recovered because
of the termination state.
Definition: A proper policy is a stationarypolicy that leads to t with probability 1
If all stationary policies are proper, T andT are contractions with respect to a commonweighted sup-norm
The entire analytical and algorithmic theory fordiscounted problems goes through if all stationarypolicies are proper (we will assume this)
There is a strong theory even if there are im-proper policies (but they should be assumed to benonoptimal - see the textbook)
54
-
8/12/2019 MIT6 231F11 Notes Short
55/125
GENERAL ORIENTATION TO ADP
We will mainly adopt an n-state discounted
model (the easiest case - but think of HUGE n). Extensions to SSP and average cost are possible(but more quirky). We will set aside for later.
There are many approaches:
Manual/trial-and-error approach
Problem approximation Simulation-based approaches (we will focus
on these): neuro-dynamic programmingorreinforcement learning.
Simulation is essential for large state spaces
because of its (potential) computational complex-ity advantage incomputing sums/expectations in-volving a very large number of terms.
Simulation also comes in handy whenan ana-lytical model of the system is unavailable, but a
simulation/computer model is possible. Simulation-based methods are of three types:
Rollout (we will not discuss further) Approximation in value space
Approximation in policy space55
-
8/12/2019 MIT6 231F11 Notes Short
56/125
APPROXIMATION IN VALUE SPACE
Approximate J
or J from a parametric classJ(i, r)where i is the current state and r= (r1, . . . , ris a vector of tunable scalars weights.
By adjusting r we can change the shape ofJso that it is reasonably close to the true optimalJ.
Two key issues:
The choice of parametric class J(i, r) (theapproximation architecture).
Method for tuning the weights (trainingthe architecture).
Successful application strongly depends on howthese issues are handled, and on insight about theproblem.
A simulator may be used, particularly whenthere is no mathematical model of the system (butthere is a computer model).
We will focus on simulation, but this is not theonly possibility[e.g.,J(i, r) may be a lower boundapproximation based on relaxation, or other prob-lem approximation]
m)
56
-
8/12/2019 MIT6 231F11 Notes Short
57/125
APPROXIMATION ARCHITECTURES
Divided in linear and nonlinear[i.e., linear ornonlinear dependence ofJ(i, r) onr].
Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.
Computer chess example: Uses a feature-based
position evaluator that assigns a score to eachmove/position.
FeatureExtraction
Weightingof Features
Score
Features:
Material balance,
Mobility,
Safety, etc
Position Evaluator
Many context-dependent special features. Most often the weighting of features is linearbut multistep lookahead is involved.
In chess, most often the training is done by trialand error.
57
-
8/12/2019 MIT6 231F11 Notes Short
58/125
LINEAR APPROXIMATION ARCHITECTURE
Ideally, the features encode much of the nonlin-earity inherent in the cost-to-go approximated
Then the approximation may be quite accuratewithout a complicated architecture.
With well-chosen features, we can use a linear
architecture:J(i, r) =(i)
r,i = 1, . . . , n, or morecompactlyJ(r) = r
: the matrix whose rows are (i), i= 1, . . . , n
State i Feature ExtractionMapping Mapping
Feature Vector (i) Linear
Linear CostApproximator (i)r
This is approximation on the subspace
S={r|r s}spanned by the columns of
(basis functions)
Many examples of feature types: Polynomialapproximation, radial basis functions, kernels ofall sorts, interpolation, and special problem-specific(as in chess and tetris)
58
-
8/12/2019 MIT6 231F11 Notes Short
59/125
APPROXIMATION IN POLICY SPACE
A brief discussion; we will return to it at theend.
We parameterize the set of policies by a vectorr= (r1, . . . , rs) and we optimize the cost over r
Discounted problem example:
Each value of r defines a stationary policy,with cost starting at state i denoted by J(i; r).
Use a random search, gradient, or other methodto minimize over r
n
J(r) =
piJ(i; r),i=1
where (p1, . . . , pn) is some probability distri-bution over the states.
In a special case of this approach, the param-
eterization of the policies is indirect, through anapproximate cost function.
A cost approximation architecture parame-terized byr, defines a policy dependent onrvia the minimization in Bellmans equation.
59
-
8/12/2019 MIT6 231F11 Notes Short
60/125
APPROX. IN VALUE SPACE - APPROACHES
Approximate PI(Policy evaluation/Policy im-provement)
Uses simulation algorithms to approximatethe cost J of the current policy
Projected equation and aggregation approaches
Approximation of the optimal costfunction J
Q-Learning: Use a simulation algorithm toapproximate the optimal costs J(i) or theQ-factors
n
Q
(i, u) =g(i, u) +
pij (u)J
(j)j=1
Bellman error approach:Find r to2
min EiJ(i, r) (T J)(i, r)r where Ei{} is taken with respect to somedistribution
Approximate LP(we will not discuss here)60
-
8/12/2019 MIT6 231F11 Notes Short
61/125
APPROXIMATE POLICY ITERATION
General structure
System Simulator
Decision GeneratorDecision(i) S
Cost-to-Go Approximatorn Generatorr Supplies Values J(j, r) D
Cost Approximation
n Algorithm
J(j, r
State i
J(j, r) is the cost approximation for the pre-ceding policy, used by the decision generator tocompute the current policy [whose cost is ap-proximated by J(j, r) using simulation]
There are several cost approximation/policyevaluation algorithms
There are several important issues relating tothe design of each block (to be discussed in the
future).61
Samples
-
8/12/2019 MIT6 231F11 Notes Short
62/125
POLICY EVALUATION APPROACHES I
Direct policy evaluation Approximate the cost of the current policy byusing least squares and simulation-generated costsamples
Amounts to projection ofJ onto the approxi-
mation subspace
Subspace S={r| r s}
= 0
J
J
Direct Method: Projection of
cost vector J
Solution of the least squares problem by batchand incremental methods
Regular and optimistic policy iteration
Nonlinear approximation architectures may alsobe used
62
-
8/12/2019 MIT6 231F11 Notes Short
63/125
POLICY EVALUATION APPROACHES II
Indirect policy evaluation
S: Subspace spanned by basis functions
T(!rk)= g + "P!rk
0
Value Iterate
Projectionon S
!rk+1
Simulation error
S: Subspace spanned by basis functions
!rk
T(!rk)= g + "P!rk
0
!rk+1
Value Iterate
Projectionon S
Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)
!rk
An example of indirect approach: Galerkin ap-proximation
Solve theprojected equation r= T(r)where is projection w/ respect to a suit-able weighted Euclidean norm
TD(): Stochastic iterative algorithm for solv-ing r= T(r)
LSPE(): A simulation-based form of pro-jected value iteration
rk+1= T(rk) + simulation noise
LSTD(): Solves a simulation-based approx-
imation w/ a standard solver (Matlab)63
-
8/12/2019 MIT6 231F11 Notes Short
64/125
POLICY EVALUATION APPROACHES III
Aggregation approximation: Solve
r= DT(r)
where the rows ofDand are prob. distributions(e.g., D and aggregate rows and columns of
the linear system J=TJ).
pij(u),
dxi jy
i
x y
OriginalSystem States
Aggregate States
pxy(u) =n
i=1
dxi
n
j=1
pij(u)jy ,
DisaggregationProbabilities
AggregationProbabilities
g(x, u) =n
i=1
dxi
n
j=1
pij(u)g(i,u,j)
, g(i, u, j)
Several different choices ofD and .
64
j
-
8/12/2019 MIT6 231F11 Notes Short
65/125
POLICY EVALUATION APPROACHES IV
pij(u),
dxi jy
i
x y
Original
System States
Aggregate States
pxy(u) =n
i=1
dxi
n
j=1
pij(u)jy ,
Disaggregation
Probabilities
Aggregation
Probabilities
g(x, u) =
n
i=1
dxi
n
j=1
pij(u)g(i,u,j)
, g(i, u, j)
|
Aggregation is a systematic approach for prob-
lem approximation. Main elements: Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)
Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the
original problem
Because an exact PI algorithm is used to solvethe approximate/aggregate problem the methodbehaves more regularly than the projected equa-tion approach
65
j
-
8/12/2019 MIT6 231F11 Notes Short
66/125
THEORETICAL BASIS OF APPROXIMATE PI
If policies are approximately evaluated using anapproximation architecture such that
max |J(i, rk) Jk(i)i
| , k= 0, 1, . . .
If policy improvement is also approximate,
max |(Tk+1J)(i, rk) (i
T J)(i, rk)| , k= 0, 1, . .
Error bound:The sequence {k} generated by
approximate policy iteration satisfies+ 2
lim sup max
Jk(i)ik
J(i) (1 )2
Typical practical behavior:The method makes
steady progress up to a point and then the iteratesJk oscillate within a neighborhood of J.
66
-
8/12/2019 MIT6 231F11 Notes Short
67/125
THE USE OF SIMULATION - AN EXAMPLE
Projection by Monte Carlo Simulation: Com-pute the projection J of a vector J n onsubspace S = {r | r s}, with respect to aweighted Euclidean norm . Equivalently, find r, where
n
r = arg min rJ2 = arg min (i)i r (is r
i=1
J )r s
Setting to 0 the gradient at r,
r = 1n n
i(i)(i)i=1
i(i)J(i)i=1
Approximate by simulation the two expectedvalues
rk = 1
k k
(i (i t) t)t=1
(it)J(it)t=1
Equivalent least squares alternative:
k2
r k = arg min (it)r J(it)rs
t
=1
67
-
8/12/2019 MIT6 231F11 Notes Short
68/125
THE ISSUE OF EXPLORATION
To evaluate a policy, we need to generate cost
samples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under .
As a result, the cost-to-go estimates of theseunderrepresented states may be highly inaccurate.
This seriously impacts the improved policy. This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-tively small (e.g., a deterministic system).
One possibility for adequate exploration: Fre-quently restart the simulationand ensure that theinitial states employed form a rich and represen-tative subset.
Another possibility: Occasionally generate tran-sitions thatuse a randomly selected controlrather
than the one dictated by the policy .
Other methods, to be discussed later,use twoMarkov chains(one is the chain of the policy andis used to generate the transition sequence, theother is used to generate the state sequence).
68
-
8/12/2019 MIT6 231F11 Notes Short
69/125
APPROXIMATING Q-FACTORS
The approach described so far for policy eval-uation requires calculating expected values [andknowledge ofpij (u)] for all controls u U(i). Model-free alternative:Approximate Q-factors
n
Q(i,u,r) pij (u)j=1
g(i,u,j) +J(j)
and use for policy improvement the minimization
(i) = arg min Q(i,u,r)uU(i)
ris an adjustable parameter vector and Q(i,u,r)is a parametric architecture, such as
s
Q(i,u,r) =
rmm(i, u)m=1
We can use any approach for cost approxima-tion, e.g., projected equations, aggregation.
Use the Markov chain with states (i, u) -pij ((i))is the transition prob. to (j, (i)), 0 to other (j, u).
Major concern:Acutely diminished exploration.69
-
8/12/2019 MIT6 231F11 Notes Short
70/125
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
Review of approximation in value space
Approximate VI and PI Projected Bellman equations
Matrix form of the projected equation
Simulation-based implementation
LSTD and LSPE methods Optimistic versions
Multistep projected Bellman equations
Bias-variance tradeoff
70
-
8/12/2019 MIT6 231F11 Notes Short
71/125
DISCOUNTED MDP
System: Controlled Markov chain with states
i= 1, . . . , nand finite set of controls u U(i) Transition probabilities: pij (u)
! "
#!"!$"
#!!!$" #" "!$"
#"!!$"
Cost of a policy = {0, 1, . . .} starting atstate i:
N
J(i) = lim EN
kg ik, k(ik), ik+1 i=i0k
=0 |
with [0, 1) Shorthand notation for DP mappings
n
(T J)(i) = min
pij (u)
g(i,u,j)+J(j)
, i= 1, . . . ,uU(i)
j=1
n
(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .
j=1
71
-
8/12/2019 MIT6 231F11 Notes Short
72/125
SHORTHAND THEORY A SUMMARY
Bellmans equation:J
=T J
, J=TJ orn
J(i) = min
pij (u)
g(i,u,j) +J(j) , iuU(i)
j=1
n
J(i) =
pij
(i) i,=1
g
(i), j
+ J(j)j
, i
Optimality condition:
: optimal T J =T J
i.e.,
n
(i) arg minuU(i)
pij (u)j=1
g(i,u,j)+J(j)
, i
72
-
8/12/2019 MIT6 231F11 Notes Short
73/125
THE TWO MAIN ALGORITHMS: VI AND PI
Value iteration:For any J n
J(i) = lim (TkJ)(i),k
i= 1, . . . , n
Policy iteration:Given k
Policy evaluation:Find Jk by solving
n
Jk(i) =
p k k
ij
(i)
g
i, (i), j
+Jk (j)
, i= 1, .
j=1
or Jk =TkJk
Policy improvement:Letk+1 be such thatn
k+1(i)arg min
pij (u)
g(i,u,j)+Jk(j)
uU(i)j=1
, i
or Tk+1Jk =T Jk
Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)
73
-
8/12/2019 MIT6 231F11 Notes Short
74/125
APPROXIMATION IN VALUE SPACE
Approximate J
or J from a parametric classJ(i, r), where i is the current state and r= (r1, . . . , rmis a vector of tunable scalars weights.
By adjustingr we can change the shape ofJso that it is close to the true optimal J.
Any r s
defines a (suboptimal) one-steplookahead policy
n
(i) = arg min
pij (u)
g(i,u,j)+J(j, r)uU(i)
j=1
, i
We will focus mostly on linear architectures
J(r) = r
where is an n s matrix whose columns areviewed as basis functions
Think n: HUGE, s: (Relatively) SMALL
For J(r) = r, approximation in value spacemeans approximation ofJ or J within the sub-space
S= r r s{ |
}
74
-
8/12/2019 MIT6 231F11 Notes Short
75/125
APPROXIMATE VI
Approximates sequentially Jk(i) = (TkJ0)(i),
k= 1, 2, . . ., with Jk(i, rk) The starting function J0 is given (e.g., J0 0) After a large enough numberNof steps,JN(i, rN)is used as approximation J(i, r) to J(i)
Fitted Value Iteration: A sequential fit toproduce Jk+1 from Jk, i.e., Jk+1 T Jk or (for asingle policy ) Jk+1 TJk
For a small subset Sk of statesi, computen
(T Jk)(i) = min puU(i)
ij (u)j=1
g(i,u,j) +Jk(j, r) Fit the function Jk+1(i, rk+1) to the small
set of values (T Jk)(i), i Sk Simulation can be used for model-free im-
plementation
Error Bound: If the fit is uniformly accuratewithin >0 (i.e., maxi|Jk+1(i) T Jk(i)| ),
2lim sup max Jk(i, rk)
ik =1,...,n J(i)
(1 )2
75
-
8/12/2019 MIT6 231F11 Notes Short
76/125
AN EXAMPLE OF FAILURE
Consider two-state discounted MDP with states1 and 2, and a single policy.
Deterministic transitions: 12 and 2
2 Transition costs 0, so J (1) =J(2) = 0.
Consider approximate VI scheme that approxi-
mates cost functions inS=
(r, 2r)|r 1
a weighted least squares fit; here =2
with
Given Jk = (rk, 2rk), we find Jk+1= (rk+1, 2rk+1),where for weights 1, 2>0, rk+1 is obtained as
2rk+1 = arg min
1
r(T Jk)(1)2
+2 2r(T Jk)(2)r
With straightforward calculation
rk+1 =rk, where= 2(1+22)/(1+42)>1
So if >1/, the sequence {rk} diverges andso does {Jk}.
Difficulty is that T is a contraction, but T(= least squares fit composed with T) is not
Norm mismatch problem76
-
8/12/2019 MIT6 231F11 Notes Short
77/125
APPROXIMATE PI
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
J(r) = r Using Simulation
Generate Improved Policy
Evaluation of typical policy:Linear cost func-tion approximation J(r) = r, where is furank n s matrix with columns the basis func-tions, and ith row denoted (i).
Policy improvementto generate :n
(i) = arg min
pij (u)
g(i,u,j) +(j)ruU(i)
j=1
Error Bound:If
max |Jk(i, rk) Jk(i)| , k= 0, 1, . . .iThe sequence {k} satisfies
2lim sup max J k(i) J (i)
ik
(1 )2
ll
77
-
8/12/2019 MIT6 231F11 Notes Short
78/125
POLICY EVALUATION
Lets consider approximate evaluation of thecost of the current policy by using simulation.
Direct policy evaluation- Cost samples gen-erated by simulation, and optimization byleast squares
Indirect policy evaluation- solving the pro-jected equation r = T(r) where isprojection w/ respect to a suitable weightedEuclidean norm
S: Subspace spanned by basis functions
0
#J
Projectionon S
S: Subspace spanned by basis functions
T(!r)
0
!r = #T(!r)
Projectionon S
J
Direct Mehod: Projection of cost vector J Indirect method: Solving a projectedform of Bellmans equation
Recall that projection can be implemented bysimulation and least squares
78
-
8/12/2019 MIT6 231F11 Notes Short
79/125
WEIGHTED EUCLIDEAN PROJECTIONS
Consider a weighted Euclidean norm
J = n i
i=1
J(i)
2,
where is a vector of positive weights 1, . . . , n. Let denote the projection operation onto
S={r|r s}
with respect to this norm, i.e., for any J
n,
J= r
wherer = arg min
rsJr2
79
-
8/12/2019 MIT6 231F11 Notes Short
80/125
PI WITH INDIRECT POLICY EVALUATION
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
J(r) = r Using Simulation
Generate Improved Policy
Given the current policy :
We solve the projected Bellmans equation
r= T(r)
We approximate the solution Jof Bellmansequation
J=TJ
with the projected equation solution J(r)
80
-
8/12/2019 MIT6 231F11 Notes Short
81/125
KEY QUESTIONS AND RESULTS
Does the projected equation have a solution?
Under what conditions is the mapping T acontraction, so T has unique fixed point?
Assuming T has unique fixed point r , howclose is r to J?
Assumption: The Markov chain correspondingtohas a single recurrent class and no transientstates, i.e., it has steady-state probabilities thatare positive
N1
j = lim P(ik =jNN k=1
|i0 =i)>0
Proposition: (Norm Matching Property)
(a) T is contraction of modulus with re-spect to the weighted Euclidean norm ,where = (1, . . . , n) is the steady-state
probability vector.
(b) The unique fixed point r ofT satisfies
1J r1 2
J J81
-
8/12/2019 MIT6 231F11 Notes Short
82/125
PRELIMINARIES: PROJECTION PROPERTIE
Important property of the projection on Swith weighted Euclidean norm . For all Jn, J S, thePythagorean Theoremholds:
J J2 = JJ2+ J J2
Proof:Geometrically, (J
J) and (
J J) areorthogonal in the scaled geometry of the norm , where two vectors x, y n
are orthogonal
ifn
i=1ixiyi = 0. Expand the quadratic in theRHS below:
J J2 = (JJ) + ( J J)
2
The Pythagorean Theorem implies that thepro-jection is nonexpansive, i.e.,
JJ J J, for allJ, J n.
To see this, note that
(J J)2 2 2(J J)
+
(I)(J J)
= J J 2
82
-
8/12/2019 MIT6 231F11 Notes Short
83/125
PROOF OF CONTRACTION PROPERTY
Lemma: IfP is the transition matrix of,
P z z, z n
Proof: Let pij be the components of P. For allz n, we have
n n
P z2 = i 2
n npij zn
2j
i=1 =1
i pij zjj i=1 j=1
n n
=j=1
ipij z2j =
i=1
j z2j =
j=1
z2 ,
where the inequality follows from the convexity ofthe quadratic function, and the next to last equal-ity follows from the defining property
ni=1ipij =
j of the steady-state probabilities.
Using the lemma, the nonexpansiveness of,
and the definition TJ=g+P J, we have
TJTJ TJTJ =P(JJ) JJ
for all J, J n. Hence T is a contraction ofmodulus .
83
-
8/12/2019 MIT6 231F11 Notes Short
84/125
PROOF OF ERROR BOUND
Let r
be the fixed point ofT. We have
1J r J J .1 2
Proof:We have
2J 2 2 r = J J+J r
J
= J 22+
T J T(r) J J2+2J r2,
where The first equality uses the Pythagorean The-
orem
The second equality holds because J is thefixed point ofT and r is the fixed point
of
T The inequality uses the contraction propertyofT.
Q.E.D.
84
-
8/12/2019 MIT6 231F11 Notes Short
85/125
MATRIX FORM OF PROJECTED EQUATION
Its solution is the vector J = r
, where r
solves the problem
2min
r (g+Pr)rs
.
Setting to 0 the gradient with respect to r ofthis quadratic, we obtain
r (g+Pr) = 0,
where is the diagonal matrix wi
th the steady-
state probabilities 1, . . . , n along the diagonal.
This is just the orthogonality condition: Theerror r (g +Pr) is orthogonal to thesubspace spanned by the columns of.
Equivalently,
Cr =d,
where
C= (I P), d= g.
85
-
8/12/2019 MIT6 231F11 Notes Short
86/125
PROJECTED EQUATION: SOLUTION METHOD
Matrix inversion: r
=C1
d Projected Value Iteration (PVI) method:
rk+1 = T(rk) = (g+Prk)
Converges to r becauseT is a contraction.
S: Subspace spanned by basis functions
!rk
T(!rk)= g + "P!rk
0
!rk+1
Value Iterate
Projectionon S
PVI can be written as:
rk+1 = arg minrs
2r (g+Prk)By setting to 0 the gradient with respect
to r,
rk+1 (g+Prk)
= 0,
which yields
r =r ()1(Cr d)k+1 k
k
86
-
8/12/2019 MIT6 231F11 Notes Short
87/125
SIMULATION-BASED IMPLEMENTATIONS
Key idea:Calculate simulation-based approxi-mations based on k samples
Ck C, dk d
Matrix inversion r = C1d is approximated
byrk =C
1k dk
This is the LSTD(Least Squares Temporal Dif-ferences) Method.
PVI methodrk+1 =rk
()1(Crk
d) is
approximated by
rk+1 =rk Gk(Ckrk dk)
whereG
( k ) 1
This is the LSPE(Least Squares Policy Evalua-tion) Method.
Key fact: Ck, dk, and Gk can be computedwith low-dimensional linear algebra (of order s;the number of basis functions).
87
-
8/12/2019 MIT6 231F11 Notes Short
88/125
SIMULATION MECHANICS
Wegenerate an infinitely long trajectory (i0, i1, . . .)of the Markov chain, so states i and transitions
(i, j) appear with long-term frequenciesiandpij .
After generating the transition (it, it+1), wecompute the row (i )t of and the cost com-ponent g(it, it+1).
We form
k1
Ck =
(i (i t)
(it) t+1)
+ 1=0
(IP)
kt
k1
dk = (it)g(it, it+1)k+ 1 t=0
g
Also in the case of LSPE
k1
Gk = (i t)
t)(i k+ 1
t=0
Convergence based on law of large numbers.
Ck, dk, and Gk can be formed incrementally.Also can be written using the formalism of tem-poral differences(this is just a matter of style)
88
-
8/12/2019 MIT6 231F11 Notes Short
89/125
OPTIMISTIC VERSIONS
Instead of calculating nearly exact approxima-tions Ck C and dk d, we do a less accurateapproximation, based on few simulation samples
Evaluate (coarsely) current policy , then do apolicy improvement
This often leads to faster computation (as op-timistic methods often do)
Very complex behavior (see the subsequent dis-cussion on oscillations)
The matrix inversion/LSTD method has serious
problems due to large simulation noise (because oflimited sampling)
LSPE tends to cope better because of its itera-tive nature
A stepsize (0, 1] in LSPE may be useful todamp the effect of simulation noise
rk+1 =rk Gk(Ckrk dk)
89
-
8/12/2019 MIT6 231F11 Notes Short
90/125
MULTISTEP METHODS
Introduce a multistep version of Bellmans equa-tion J=T()J, where for
[0, 1),
T() = (1 )
T+1
=0
Geometrically weighted sum of powers ofT.
Note that T is a contraction with modulus
, with respect to the weighted Euclidean norm, whereis the steady-state probability vectorof the Markov chain.
Hence T() is a contraction with modulus
= (1 ) (1+1 = )1 =0
Note that 0 as 1 Tt and T() have the same fixed point J and
1J r J J 1 2
where r is the fixed point ofT().
The fixed point r depends on . 90
-
8/12/2019 MIT6 231F11 Notes Short
91/125
BIAS-VARIANCE TRADEOFF
Subspace S={r| r s}
J
Simulation errorJ
Bias
= 0
= 1
Solution of projected equation
Simulation error
r= T()(r)
Error bound Jr 1 2J 1
J
As 1, we have 0, so error bound (andthe quality of approximation) improves as 1.In fact
limr = J1
But the simulation noise in approximatingT() = (1 ) T+1
=0
increases
Choice of is usually based on trial and error91
-
8/12/2019 MIT6 231F11 Notes Short
92/125
MULTISTEP PROJECTED EQ. METHODS
The projected Bellman equation is
r= T()(r)
In matrix form: C()r=d(), where
C() =
I P(), d() = g(),with P() = (1 )
P+1, g() =
=0
Pg
=0
The LSTD() methodis
(C
) 1 (
k d
)
k ,
( ) ( )whereC
andd
k k
are
simulation-based approx-
imations ofC() andd().
The LSPE() methodis
()
()
rk+1 =rk Gk Ck rk dk
where Gkis a simulation-b
ased approx. to
()1
TD():An important simpler/slower iteration[similar to LSPE() with Gk =I - see the text].
92
-
8/12/2019 MIT6 231F11 Notes Short
93/125
MORE ON MULTISTEP METHODS
( ) ( )
The simulation process to obtainC
k andd
k
is similar to the case = 0 (single simulation tra-jectory i0, i1, . . . more complex formulas)
k k(
C ) 1
=
(i )
mttk
mt(im)(im+1)
k+ 1t=0 m=t
k k(
d ) 1
=
(i mt mtt ik ) k+ 1t m
g m
=0 =t
In the context of approximate policy iteration,we can use optimistic versions (few samples be-
tween policy updates).
Many different versions (see the text).
Note the -tradeoffs:
As (1, C ) (k and d )k contain more sim-ulation noise, so more samples are neededfor a close approximation ofr (the solutionof the projected equation)
The error bound Jrbecomes smaller As 1, T() becomes a contraction for
arbitraryprojection norm93
-
8/12/2019 MIT6 231F11 Notes Short
94/125
6.231 DYNAMIC PROGRAMMING
LECTURE 5
LECTURE OUTLINE
Review of approximate PI
Review of approximate policy evaluation basedon projected Bellman equations
Exploration enhancement in policy evaluation
Oscillations in approximate PI
Aggregation An alternative to the projected
equation/Galerkin approach
Examples of aggregation
Simulation-based aggregation
94
-
8/12/2019 MIT6 231F11 Notes Short
95/125
DISCOUNTED MDP
System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u
U(i)
Transition probabilities: pij (u)
! "
#!"!$"
#!!!$" #" "!$"
#"!!$"
Cost of a policy = {0, 1, . . .} starting atstate i:
N
J(i) = lim EN
kg ik, k(ik), ik+1 i=i0k
=0 |
with [0, 1) Shorthand notation for DP mappings
n
(T J)(i) = min
pij (u)
g(i,u,j)+J(j)
, i= 1, . . . , n ,uU(i)
j=1
n
(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . , n
j=1
95
-
8/12/2019 MIT6 231F11 Notes Short
96/125
APPROXIMATE PI
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
J(r) = r Using Simulation
Generate Improved Policy
Evaluation of typical policy:Linear cost func-tion approximation
J(r) = r
where is full rank n s matrix with columns
the basis functions, and ith row denoted
(i)
. Policy improvementto generate :
n
(i) = arg min pij (u) g(i,u,j) +(j)ruU(i)
j=1
96
-
8/12/2019 MIT6 231F11 Notes Short
97/125
EVALUATION BY PROJECTED EQUATIONS
We discussed approximate policy evaluation bysolving the projected equation
r= T(r)
: projection with a weighted Euclidean norm
Implementation by simulation (single long tra-
jectory using current policy - important to makeT a contraction). LSTD, LSPE methods.
( )
Multistep option:Solve r= T
(r) with
(T
)= (1 )
T+1
=0 As 1, T() becomes a contraction forany projection norm
Bias-variance tradeoff
Subspace S={r| r s}
J
Simulation errorJ
Bias
= 0
= 1
Solution of projected equation
Simulation error
r= T()(r)
97
-
8/12/2019 MIT6 231F11 Notes Short
98/125
POLICY ITERATION ISSUES: EXPLORATION
1st major issue: exploration. To evaluate ,we need to generate cost samples using
This biases the simulation by underrepresentingstates that are unlikely to occur under .
As a result, the cost-to-go estimates of these
underrepresented states may be highly inaccurate. This seriously impacts the improved policy.
This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-
tively small (e.g., a deterministic system). Common remedy is theoff-policy approach:Re-place Pof current policy with a mixture
P = (I B)P+ BQwhereB is diagonal with diagonal components in[0, 1] and Q is another transition matrix.
LSTD and LSPE formulas must be modified ...otherwise the policy P (not P) is evaluated. Re-lated methods and ideas: importance sampling,geometric and free-form sampling(see the text).
98
-
8/12/2019 MIT6 231F11 Notes Short
99/125
POLICY ITERATION ISSUES: OSCILLATIONS
2nd major issue: oscillation of policies Analysis using the greedy partition: R is theset of parameter vectors r for which is greedywith respect to J(, r) = r
R= r|T(r) =T(r) There is a finite number of possible
vectors r,
one generated from another in a deterministic way
rk
rk+1
rk+2
rk+3
Rk
Rk+1
Rk+2
Rk+3
The algorithm ends up repeating some cycle ofpolicies k, k+1, . . . , k+m with
rk Rk+1 , rk+1 Rk+2 , . . . , rk+m Rk ;
Many different cycles are possible99
-
8/12/2019 MIT6 231F11 Notes Short
100/125
MORE ON OSCILLATIONS/CHATTERING
In the case of optimistic policy iteration a dif-ferent picture holds
r1
r2
r3
R1
R2
R3
Oscillations are less violent, but the limitpoint is meaningless!
Fundamentally, oscillations are due to thelackof monotonicity of the projection operator, i.e.,J J does not imply J J. If approximate PI uses policy evaluation
r= (W T)(r)
withWa monotone operator, the generated poli-cies converge (to a possibly nonoptimal limit).
The operator W used in the aggregation ap-proach has this monotonicity property.
100
-
8/12/2019 MIT6 231F11 Notes Short
101/125
PROBLEM APPROXIMATION - AGGREGATIO
Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem.
The simplification is often ad-hoc/problem-depende
Aggregation is a systematic approach for prob-
lem approximation. Main elements: Introduce a few aggregate states,viewedas the states of an aggregate system
Define transition probabilities and costs ofthe aggregate system, by relating originalsystem states with aggregate states
Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)
Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the
original problem Hard aggregation example: Aggregate statesare subsets of original system states, treated as ifthey all have the same cost.
101
-
8/12/2019 MIT6 231F11 Notes Short
102/125
AGGREGATION/DISAGGREGATION PROBS
dxi jy
i
x y
OriginalSystem States
Aggregate States
Disaggregation
Probabilities
Aggregation
Probabilities
MatrixD Matrix
|
The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices
For each original system statej and aggregatestatey, theaggregation probability jy
Roughly, the degree of membership ofj inthe aggregate state y.
In hard aggregation, jy = 1 if state j be-longs to aggregate state/subset y.
For each aggregate state xand original systemstate i, thedisaggregation probability dxi
Roughly, the degree to which i is represen-tative ofx.
In hard aggregation, equal dxi
102
jpij(u)
-
8/12/2019 MIT6 231F11 Notes Short
103/125
AGGREGATE SYSTEM DESCRIPTION
The transition probability from aggregate statex to aggregate state y under control u
n n
pxy(u) =
dxi ori
pij (u)jy, P(u) =DP(u)
=1 j=1
where the rows ofD and are the disaggregationand aggregation probs.
The expected transition cost is
n n
g(x, u) =
dxi
pij (u)g(i,u,j), or g =DP gi=1 j=1
The optimal cost function of the aggregate prob-lem, denoted R, is
R(x) = min
g(x, u) +
pxy(u)R(y)
uUy
, x
Bellmans equation for the aggregate problem.
The optimal cost function J of the originalproblem is approximated by J given by
J(j) =
jy R(y),
y
j
103
-
8/12/2019 MIT6 231F11 Notes Short
104/125
EXAMPLE I: HARD AGGREGATION
Group the original system states into subsets,and view each subset as an aggregate state
Aggregation probs.: jy = 1 i f j belongs toaggregate state y.
Disaggregation probs.: There are many possi-bilities, e.g., all states i within aggregate state xhave equal prob. dxi.
If optimal cost vector J is piecewise constant
over the aggregate states/subsets, hard aggrega-tion is exact.Suggests grouping states with roughlyequal cost into aggregates.
A variant: Soft aggregation (provides softboundaries between aggregate states).
104
1 2 3
4 5 6
7 8 9
x1 x2
x3 x4
=
1 0 0 0
1 0 0 00 1 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 0 1
-
8/12/2019 MIT6 231F11 Notes Short
105/125
EXAMPLE II: FEATURE-BASED AGGREGATIO
Important question: How do we group statestogether?
If we know good features, it makes sense togroup together states that have similar features
States Aggregate StatesFeatures
FeatureExtraction
A general approach for passing from a feature-based state representation to an aggregation-basedarchitecture
Essentially discretize the features and generatea corresponding piecewise constant approximationto the optimal cost function
Aggregation-based architecture is more power-ful(nonlinear in the features)
... but may require many more aggregate statesto reach the same level of performance as the cor-responding linear feature-based architecture
105
-
8/12/2019 MIT6 231F11 Notes Short
106/125
EXAMPLE III: REP. STATES/COARSE GRID
Choose a collection of representative originalsystem states, and associate each one of them withan aggregate state
x j
j2
j3
y1 y2
y3
Original State Space
Representative/Aggregate States
Disaggregation probabilities are dxi = 1 if i isequal to representative state x.
Aggregation probabilities associate original sys-tem states with convex combinations of represen-tative states
j y jy yA
Well-suited for Euclidean space discretization
Extends nicely to continuous state space, in-cluding belief space of POMDP
106
j1
-
8/12/2019 MIT6 231F11 Notes Short
107/125
EXAMPLE IV: REPRESENTATIVE FEATURES
Here the aggregate states are nonempty sub-sets of original system states (but need not forma partition of the state space)
Example: Choose a collection of distinct rep-resentative feature vectors, and associate each ofthem with an aggregate state consisting of originalsystem states with similar features
Restrictions:
The aggregate states/subsets are disjoint. The disaggregation probabilities satisfy dxi >
0 if and only ifi
x.
The aggregation probabilities satisfyjy = 1for all j y.
If every original system state ibelongs to someaggregate state we obtain hard aggregation
If every aggregate state consists of a single orig-inal system state, we obtain aggregation with rep-resentative states
With the above restrictions D =I, so (D)(D) =D, and Dis an oblique projection(orthogonalprojection in case of hard aggregation)
107
-
8/12/2019 MIT6 231F11 Notes Short
108/125
APPROXIMATE PI BY AGGREGATION
pij(u),
dxi jy
i
x y
OriginalSystem States
Aggregate States
pxy(u) =n
i=1
dxi
n
j=1
pij(u)jy ,
Disaggregation
Probabilities
Aggregation
Probabilities
g(x, u) =n
i=1
dxi
n
j=1
pij(u)g(i,u,j)
, g(i, u, j)
Consider approximate policy iteration for theoriginal problem, with policy evaluation done by
aggregation. Evaluation of policy : J = R, where R =DT(R) (R is the vector of costs of aggregatestates for ). Can be done by simulation.
Looks like projected equation R= T(R)
(but with D in place of). Advantages:It has no problem with explorationor with oscillations.
Disadvantage: The rows of D and must beprobability distributions.
108
j
-
8/12/2019 MIT6 231F11 Notes Short
109/125
DISTRIBUTED AGGREGATION I
We consider decomposition/distributed solu-tionof large-scale discounted DP problems by ag-gregation.
Partition the original system states into subsetsS1, . . . , S m
Each subset S, = 1, . . . , m:
Maintains detailed/exact local costsJ(i) for every original system statei S
using aggregate costs of other subsets
Maintains an aggregate cost R() = iS diJ(i)
Sends R() to other aggregate states J(i) and R() are updated by VI according to
Jk+1(i) = min H(i,u,Jk, Rk),uU(i)
i Swith Rk being the vector ofR() at time k, and
n
H(i,u,J,R) =
pij (u)g(i , u , j) +
j=1 j
pij (u)J(j)
S
+
jS pij (u)R(
)
, =
109
-
8/12/2019 MIT6 231F11 Notes Short
110/125
DISTRIBUTED AGGREGATION II
Can show that this iteration involves a sup-norm contraction mapping of modulus , so itconverges to the unique solution of the system ofequations in (J, R)
J(i) = min H(i,u,J,R), R() =uU(i) i
diJ(i),S
i S, = 1, . . . , m .
This follows from the fact that {di | i =1, . . . , n} is a probability distribution.
View these equations as a set of Bellman equa-tions for an aggregate DP problem.The differ-ence is that the mapping H involves J(j) ratherthan R x(j) for j S. In an
asyn
chronous version of the method, the
aggregate costs R() may be outdated to accountfor communication delays between aggregate states.
Convergence can be shown using the generaltheory of asynchronous distributed computation(see the text).
110
-
8/12/2019 MIT6 231F11 Notes Short
111/125
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
Review of Q-factors and Bellman equations forQ-factors
VI and PI for Q-factors
Q-learning - Combination of VI and sampling
Q-learning and cost function approximation
Approximation in policy space
111
-
8/12/2019 MIT6 231F11 Notes Short
112/125
DISCOUNTED MDP
System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u
U(i)
Transition probabilities: pij (u)
! "
#!"!$"
#!!!$" #" "!$"
#"!!$"
Cost of a policy = {0, 1, . . .} starting atstate i:
N
J(i) = lim EN
kg ik, k(ik), ik+1 i=i0k
=0 |
with [0, 1) Shorthand notation for DP mappings
n
(T J)(i) = min
pij (u)
g(i,u,j)+J(j)
, i= 1, . . . , nuU(i)
j=1
n
(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . ,
j=1
112
-
8/12/2019 MIT6 231F11 Notes Short
113/125
THE TWO MAIN ALGORITHMS: VI AND PI
Value iteration:For any J n
J(i) = lim (TkJ)(i),k
i= 1, . . . , n
Policy iteration:Given k
Policy evaluation:Find Jk by solving
n
J k k
k(i) =
pij
(i)
g
i, (i), j
+Jk (j)
, i= 1, .
j=1
or Jk =TkJk
Policy improvement:Letk+1 be such thatn
k+1 (i)arg min pij (u) g(i,u,j)+Jk(j) , i
uU(i)
j=1
or Tk+1Jk =T Jk
We discussed approximate versions of VI andPIusing projection and aggregation
We focused so far on cost functions and approx-imation.We now consider Q-factors.
113
-
8/12/2019 MIT6 231F11 Notes Short
114/125
BELLMAN EQUATIONS FOR Q-FACTORS
The optimal Q-factors are defined byn
Q(i, u) =
p ij (u)
g(i,u,j) + J (j)
, (i, u)j=1
Since J =T J, we have J(i) = minuU(i) Q(i, u)so the optimal Q-factors solve the equation
n
Q(i, u) = pij (u)g(i,u,j) + min Q(j,u u)U(j)j=1
Equivalently Q =F Q, where
n
(F Q)(i, u) =
p ij (u)
g(i,u,j) +
min Q(j, u)u U(j)
j=1
This is Bellmans Eq. for a system whose statesare the pairs (i, u)
Similar mapping F and Bellman equation fora policy : Q =FQ
114
States
(i, u)
j
pij u
g(i, u, j)
j
j, (j)
State-Control Pairs: Fixed Policy
-
8/12/2019 MIT6 231F11 Notes Short
115/125
SUMMARY OF BELLMAN EQS FOR Q-FACTOR
Optimal Q-factors:For all (i, u)
n
Q(i, u) =
pij (u)
g(i,u,j) +
min Q(j, u)u U(j)
j=1
EquivalentlyQ =F Q, wheren
(F Q)(i, u) =
pij (u)
g(i,u,j) + min Q(j, u)u U(j)j=1
Q-factors of a policy:For all (i, u)
n
Q(i, u) = pij (u)
g(i,u,j) +Q (=1
j, j)j
EquivalentlyQ =FQ, wheren
(FQ)(i, u) = pij (u) g(i,u,j) +Q j, (j)j=1
115
States
(i, u)
j
pij ug(i, u, j)
j
j, (j)
State-Control Pairs: Fixed Policy
-
8/12/2019 MIT6 231F11 Notes Short
116/125
WHAT IS GOOD AND BAD ABOUT Q-FACTOR
All the exact theory and algorithms for costsapplies to Q-factors
Bellmans equations, contractions, optimal-ity conditions, convergence of VI and PI
All the approximate theory and algorithms forcosts applies to Q-factors
Projected equations, sampling and exploration
issues, oscillations, aggregation
A MODEL-FREE (on-line) controller imple-mentation
Once we calculate Q(i, u) for all (i, u),
(i) = arg min Q(i, u), iuU(i) Similarly, once we calculate a parametric ap-
proximation Q(i,u,r) for all (i, u),
(i) = arg min Q(i,u,r), iuU(i)
The main bad thing: Greater dimension andmore storage! [Can be used for large-scale prob-lems only through aggregation, or other cost func-tion approximation.]
116
-
8/12/2019 MIT6 231F11 Notes Short
117/125
Q-LEARNING
In addition to the approximate PI methodsadapted for Q-factors, there is an important addi-tional algorithm:
Q-learning,which can be viewed as a sam-pled form of VI
Q-learning algorithm (in its classical form): Sampling: Select sequence of pairs (ik, uk)(use any probabilistic mechanism for this,but all pairs (i, u) are chosen infinitely of-ten.)
Iteration:For eachk, selectjk according to
pikj (uk). Update just Q(ik, uk):
Qk+1(ik,uk) = (1 k)Qk(ik, uk)
+k
g(ik, uk, jk) + min
uQk(jk, u
U(jk)
Leave unchanged all other Q-factors: Qk+1(i, u) Qk(i, u) for all (i, u) = (ik, uk).
Stepsize conditions: k must converge to 0at proper rate (e.g., like 1/k).
117
-
8/12/2019 MIT6 231F11 Notes Short
118/125
NOTES AND QUESTIONS ABOUT Q-LEARNIN
Qk+1(ik,uk) = (1 k)Qk(ik, uk)
+k
g(ik, uk, jk) + Q k
min k(j , u)u U(jk)
Model free implementation. We just need asimulator that given (i, u) produces next state jand cost g(i,u,j)
Operates on only one state-control pair at atime.Convenient for simulation, no restrictions onsampling method.
Aims to find the (exactly) optimal Q-factors. Why does it converge toQ?
Why cant I use a similar algorithm for optimalcosts?
Important mathematical (fine) point:In theQ-
factor version of Bellmans equation the order ofexpectation and minimization is reversedrelativeto the cost version of Bellmans equation:
n
J(i) = min piuU(i)
j=1
j (u) g(i,u,j) +J(j) 118
-
8/12/2019 MIT6 231F11 Notes Short
119/125
CONVERGENCE ASPECTS OF Q-LEARNING
Q-learning can be shown to converge to true/exactQ-factors (under mild assumptions).
Proof is sophisticated, based on theories ofstochastic approximation and asynchronous algo-rithms.
Uses the fact that the Q-learning mapF:
(F Q)(i, u) =Ej g(i,u,j) + minu
Q(j, u)
is a sup-norm contr
action.
Generic stochastic approximation algorithm:
Consider generic fixed point problem involv-ing an expectation:
x=Ew f(x, w)
Assume Ew )
f(x, w
is a contractionwith
respect to some norm
, so the iteration
xk+1 =Ew f(xk, w)
converges to the uniqu
e fixed po
int
Approximate Ew f(x, w) by sampling 119
-
8/12/2019 MIT6 231F11 Notes Short
120/125
STOCH. APPROX. CONVERGENCE IDEAS
For each k, obtain samples {w1, . . . , wk} anduse the approximation
k1
xk+1 =
f(xk, wt)k
t=1
E f(xk, w)
This iteration approximates the
convergen
t fixedpoint iteration xk+1 =Ew
f(xk, w)
A major flaw:it requires, for each
k, the compu-
tation off(xk, wt) forallvalues wt, t= 1, . . . , k.
This motivates the more convenient iteration
k1xk+1 =
f(xt, wt), k= 1, 2, . . . ,
kt=1
that is similar, but requires much less computa-tion; it needsonly onevalue offper sample wt.
By denoting k = 1/k, it can also be written as
xk+1= (1 k)xk+kf(xk, wk), k= 1, 2, . . . Compare with Q-learning,where the fixed pointproblem is Q=F Q
(F Q)(i, u) =Ej g(i,u,j) + min Q(j, u
)
u
120
-
8/12/2019 MIT6 231F11 Notes Short
121/125
Q-FACTOR APROXIMATIONS
We introduce basis function approximation:Q(i,u,r) =(i, u)r
We can use approximate policy iteration andLSPE/LSTD for policy evaluation
Optimistic policy iteration methods are fre-quently used on a heuristic basis
Example: Generate trajectory {(ik, uk) | k =0, 1, . . .}.
At iteration k, given rkand state/control (ik, uk):
(1) Simulate next transition (ik, ik+1) using thetransition probabilities pikj (uk).
(2) Generate control uk+1 from
uk+1 = arg min Q(ik+1, u , rk)uU(ik+1)
(3) Update the parameter vector via
rk+1 =rk (LSPE or TD-like correction) Complex behavior, unclear validity (oscilla-tions, etc). There is solid basis for an important
special case: optimal stopping (see text)121
-
8/12/2019 MIT6 231F11 Notes Short
122/125
APPROXIMATION IN POLICY SPACE
We parameterize policies by a vector r =(r1, . . . , rs)(an approximation architecture for poli-cies).
Each policy (r) =
(i; r) | i = 1, . . . , ndefines a cost vector J(r) (a function ofr).
We optimize some measure ofJ(r) over r. For example, use a random search, gradient, orother method to minimize over r
n
piJ(r)(i),i=1
where (p1, . . . , pn) is some probability distributionover the states.
An important special case:Introduce cost ap-proximation architectureV(i, r) that defines indi-
rectly the parameterization of the policies
n
(i; r) = arg min
pij (u) i, iuU )
j=1
g( u, j)+V(j, r)
(
,
i
Brings in features to approximation in policy
space 122
-
8/12/2019 MIT6 231F11 Notes Short
123/125
APPROXIMATION IN POLICY SPACE METHOD
Random search methods are straightforwardand have scored some impressive successes withchallenging problems (e.g., tetris).
Gradient-type methods (known as policy gra-dient methods) also have been worked on exten-sively.
They move along the gradient with respect tor of
npiJ(r)(i),
i=1
There are explicit gradient formulas which havebeen approximated by simulation
Policy gradient methods generally suffer by slowconvergence, local minima, and excessive simula-tion noise
123
-
8/12/2019 MIT6 231F11 Notes Short
124/125
FINAL WORDS AND COMPARISONS
There is no clear winneramong ADP methods There is interesting theoryin all types of meth-ods (which, however, does not provide ironcladperformance guarantees)
There are major flaws in all methods:
Oscillations and exploration issues in approx-imate PI with projected equations
Restrictions on the approximation architec-ture in approximate PI with aggregation
Flakiness of optimization in policy space ap-proximation
Yet these methods have impressive successesto show with enormously complex problems, forwhich there is no alternative methodology
There are also other competing ADP methods
(rollout is simple, often successful, and generallyreliable; approximate LP is worth considering)
Theoretical understanding is important andnontrivial
Practice is an art and a challenge to our cre-
ati