mit6 231f11 notes short

8/12/2019 MIT6 231F11 Notes Short

1/125

APPROXIMATE DYNAMIC PROGRAMMING

A SERIES OF LECTURES GIVEN AT

CEA - CADARACHE

FRANCE

SUMMER 2012

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:

Dynamic Programming and Optimal Con-trol: Approximate Dynamic Programming,Athena Scientific, 2012; see

http://www.athenasc.com/dpbook.html

For a fuller set of slides, see

http://web.mit.edu/dimitrib/www/publ.html

1
http://www.athenasc.com/dpbook.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://www.athenasc.com/dpbook.html


2/125


BRIEF OUTLINE I

Our subject:

Large-scale DP based on approximations andin part on simulation.

This has been a research area of great inter-est for the last 20 years known under variousnames (e.g., reinforcement learning, neuro-dynamic programming)

Emerged through an enormously fruitful cross-fertilization of ideas from artificial intelligence

and optimization/control theory Deals with control of dynamic systems under

uncertainty, but applies more broadly (e.g.,discrete deterministic optimization)

A vast range of applications in control the-ory, operations research, artificial intelligence,and beyond ...

The subject is broad with rich variety oftheory/math, algorithms, and applications.Our focus will be mostly on algorithms ...less on theory and modeling

2


3/125


BRIEF OUTLINE II

Our aim:

A state-of-the-art account of some of the ma-jor topics at a graduate level

Show how the use of approximation and sim-ulation can address the dual curses of DP:

dimensionality and modeling Our 7-lecture plan:

Two lectures on exact DPwith emphasis oninfinite horizon problems and issues of large-scale computational methods

One lecture on general issues of approxima-tion and simulationfor large-scale problems

One lecture on approximate policy iterationbased on temporal differences (TD)/projectedequations/Galerkin approximation

One lecture on aggregation methods

One lecture onstochastic approximation, Q-learning, and other methods

One lecture on Monte Carlo methods forsolving general problems involving linear equa-tions and inequalities

3


4/125


LECTURE 1

LECTURE OUTLINE

Introduction to DP and approximate DP

Finite horizon problems The DP algorithm for finite horizon problems

Infinite horizon problems

Basic theory of discounted infinite horizon prob-lems

4


5/125

BASIC STRUCTURE OF STOCHASTIC DP

Discrete-time system

xk+1=fk(xk, uk, wk), k= 0, 1, . . . , N 1 k: Discrete time xk: State;summarizes past information that

is relevant for future optimization uk: Control; decision to be selected at timek from a given set

wk: Random parameter(also called distur-bance or noise depending on the context)

N: Horizon or number of times control isapplied

Cost function that is additive over time

E N1

gN(xN) + gk(xk, uk, wk)k

=0

Alternative system description:P(xk+1|xk, uk)

xk+1 =wk with P(wk |xk, uk) =P(xk+1 |xk, uk)

5


6/125

INVENTORY CONTROL EXAMPLE

Inventory

System

Stock Ordered at

Period k

Stock at Period k Stock at Period k + 1

Demand at Period k

xk

wk

xk + 1= xk + uk - wk

ukCost ofP eriod k

c uk+ r (xk + uk- wk)

Discrete-time system

xk+1 =fk(xk, uk, wk) =xk+ uk wk

Cost function that is additive over time

E N1

gN(xN) + gk(xk, uk, wk)k

N

=0

=E

1

cuk+ r(xk+ ukk=0

wk)

6


7/125

ADDITIONAL ASSUMPTIONS

Optimization over policies:These are rules/function

uk =k(xk), k= 0, . . . , N 1

that map states to controls (closed-loop optimiza-tion, use of feedback)

The set of values that the control uk can takedepend at most on xk and not on prior x or u

Probability distribution ofwk does not dependon past values wk1, . . . , w0, but may depend onxk and uk

Otherwise past values of w or x would beuseful for future optimization

7


8/125

GENERIC FINITE-HORIZON PROBLEM

Systemxk+1 =fk(xk, uk, wk),k= 0, . . . , N 1 Control contraintsuk Uk(xk) Probability distributionPk( |xk, uk) ofwk

Policies = {0, . . . , N1}, where k mapsstates xk into controls uk = k(xk) and is such

that k(xk) Uk(xk) for all xk Expected costof starting at x0 is

N1

J(x0) =E

gN(xN) + gk(xk, k(xk), wk)k=0

Optimal cost function

J(x0) = min J(x0)

Optimal policy

satisfies

J(x0) =J(x0)

When produced by DP, is independent ofx0.

8


9/125

PRINCIPLE OF OPTIMALITY

Let

={

,

0 1, . . . , N1} be optimal policy Consider thetail subproblemwhereby we areat xk at time k and wish to minimize the cost-to-go from time k to time N

E N1

gN(xN) +

g

x, (x), w=k

and thetail policy {k,

k+1, . . . ,

N1}

!"#$ &'()*+($,-

!#-,!!

"!

#

Principle of optimality:The tail policy is opti-mal for the tail subproblem (optimization of the

future does not depend on what we did in the past) DP solves ALL the tail subroblems

At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution ofthe tail subproblems of shorter time length

9


10/125

DP ALGORITHM

Jk(xk): opt. cost of tail problem starting at xk

Start with

JN(xN) =gN(xN),

and go backwards using

Jk(xk) = min E gk(xk, uk, wk)ukUk(xk) wk

+ Jk+1 fk(xk, u

k, wk) , k= 0, 1, . . . , N 1

i.e., to solve ta

il subproblem a

t time k minimize

Sum ofkth-stage cost + Opt. cost of next tail proble

starting from next state at time k+ 1

ThenJ0(x0), generated at the last step, is equalto the optimal cost J(x0). Also, the policy

={0, . . . , N1}

wherek(xk) minimizes in the right side above foreach xk andk, is optimal

Proof by induction10


11/125

PRACTICAL DIFFICULTIES OF DP

The curse of dimensionality Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

Quick explosion of the number of states incombinatorial problems

Intractability of imperfect state informationproblems

The curse of modeling

Sometimes a simulator of the system is easierto construct than a model

There may be real-time solution constraints

A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice

The problem data may change as the systemis controlled need for on-line replanning

All of the above are motivations for approxi-mation and simulation

11


12/125

COST-TO-GO FUNCTION APPROXIMATION

Use a policy computed from the DP equationwhere the optimal cost-to-go function Jk+1 is re-placed by an approximation Jk+1.

Apply k(xk), which attains the minimum in

min E

g ( J

k xk, uk, wk)+ k+1

fk(xk, uk, wk)ukUk(xk)

Some approaches:

(a) Problem Approximation:Use Jkderived froma related but simpler problem

(b) Parametric Cost-to-Go Approximation:Useas Jk a function of a suitable parametricform, whose parameters are tuned by someheuristic or systematic scheme (we will mostlyfocus on this)

This is a major portion of ReinforcementLearning/Neuro-Dynamic Programming

(c) Rollout Approach: Use as Jk the cost ofsome suboptimal policy, which is calculatedeither analytically or by simulation

12


13/125

ROLLOUT ALGORITHMS

At each k and state xk, use the control k(xk)that minimizes in

min E g k(xk, uk, wk)+Jk+1ukUk(xk)

whereJ

is the cost-to-go of so

fk(xk, uk, wk)

,

k+1 me heuristic pol-

icy (called the base policy).

Cost improvement property: The rollout algo-rithm achieves no worse (and usually much better)cost than the base policy starting from the samestate.

Main difficulty: Calculating Jk+1(x) may becomputationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.

May involve Monte Carlo simulation if theproblem is stochastic.

Things improve in the deterministic case. Connection w/ Model Predictive Control (MPC)

13


14/125

INFINITE HORIZON PROBLEMS

Same as the basic problem, but: The number of stages is infinite. The system is stationary.

Total cost problems:Minimize

N1

J k(x0) = lim E g xk, k(xk), wkN wk

k=0,1,...

k

=0

Discounted problems (


15/125

DISCOUNTED PROBLEMS/BOUNDED COST

Stationary system

xk+1 =f(xk, uk, wk), k= 0, 1, . . .

Cost of a policy ={0, 1, . . .}

N1J k(x0) = lim E g xk, k(xk), wk

N wkk=0,1,...

k

=0

with < 1, and g is bounded [for some M, wehave|g(x,u,w)| M for all (x,u,w)] Boundedness ofg guarantees that all costs arewell-defined and bounded: J(x) M1 All spaces are arbitrary

- only

boundedness of

g is important (there are math fine points, e.g.measurability, but they dont matter in practice)

Important special case: All underlying spacesfinite; a (finite spaces) Markovian Decision Prob-lemor MDP

All algorithms essentially work with an MDPthat approximates the original problem

15


16/125

SHORTHAND NOTATION FOR DP MAPPING

For any function J ofx

(T J)(x) = min E

g(x,u,w) + J

f(x,u,w)uU(x) w

, x

T J is the optimal cost function for the one-

stage problem with stage cost g and terminal costfunction J.

T operates on bounded functions ofx to pro-duce other bounded functions ofx

For any stationary policy

(TJ)(x) =E

g

x, (x), w

+ J

f(x, (x), w)

, xw

The critical structure of the problem is cap-tured in T and T

The entire theory of discounted problems canbe developed in shorthand using T andT

This is true for many other DP problems

16


17/125

FINITE-HORIZON COST EXPRESSIONS

Consider an N-stage policyN

0 ={0, 1, . . . , N1with a terminal cost J: N1

JN(x0) =E NJ(xk) +

g

x, (x), w0

=0

=E

g

x0, 0(x0), w0

+JN(x1)1

= (T0JN)(x0)

1

where N1 ={1, 2, . . . , N1}

By induction we have

JN(x) = (T0T1 TN1J)(x),0

x For a stationary policytheN-stage cost func-tion (with terminal cost J) is

JN =TN

J0

where TN is the N-fold composition ofT

Similarly the optimal N-stage cost function(with terminal cost J) is TNJ

TNJ=T(TN1J) is just the DP algorithm17


18/125

SHORTHAND THEORY A SUMMARY

Infinite horizon cost function expressions[withJ0(x) 0]

J(x) = lim (T N

0T1 TNJ0)(x), J(x) = lim (T J0)(N N

Bellmans equation:J =T J, J=TJ

Optimality condition:

: optimal T J =T J

Value iteration:For any (bounded) J

J(x) = lim (TkJ)(x),k

x

Policy iteration:Given k,

Policy evaluation:Find Jk by solving

Jk =TkJk

Policy improvement: Find k+1 such that

Tk+1Jk =T Jk

18


19/125

TWO KEY PROPERTIES

Monotonicity property:For any J and J

suchthat J(x) J(x) for all x, and any

(T J)(x) (T J)(x), x,

(T J)(x)

(T J )(x),

x.

Constant Shift property:For any J, any scalarr, and any

T(J+ re)

(x) = (T J)(x) +r, x,

T(J+ re) (x) = (TJ)(x) +r, x,whe

re e is the u

nit function [e(x) 1].

Monotonicity is present in all DP models (undis-counted, etc)

Constant shift is special to discounted models Discounted problems have another propertyof major importance: T and T are contractionmappings(we will show this later)

19


20/125

CONVERGENCE OF VALUE ITERATION

IfJ0 0,J(x) = lim (TkJ0)(x), for allx

k

Proof: For any initial state x0, and policy ={0, 1, . . .},

J(x0) =E

g

x, (x), w=0

=E

k1

gx, (x), w

=0

+ E

g x, (x), w=k

The tail portion satisfies

E

kMg x, (x), w

, 1=k

where M |g(x,u,w)

|. Take the

min over of

both sides. Q.E.D.20


21/125

BELLMANS EQUATION

The optimal cost functionJ

satisfies BellmansEq., i.e. J =T J.

Proof:For all x and k,

kM kMJ(x) (TkJ0)(x) J(x) + ,

1

1

where J0(x) 0 and M |g(x,u,w)|. ApplyingT to this relation, and using Monotonicity andConstant Shift,

k+1M

(T J

)(x) (Tk+1

J0)(x)1 k+1M (T J)(x) +

1 Taking the limit as k and using the fact

lim (Tk+1J0)(x) =J(x)k

we obtain J =T J. Q.E.D.

21


22/125

THE CONTRACTION PROPERTY

Contraction property: For any bounded func-tionsJ and J, and any ,

max(T J)(x) (T J)(x)

x

max J(x) J(x) ,x

max(TJ)(x)

(TJ)(x)

max J(x)

J(x) .

x x

Proof:Denote c= maxx

S

J(x)

J(x)

. Then

J(x) c J(x) J(

x) + c,

x

Apply Tto both sides, and use the Monotonicityand Constant Shift properties:

(T J)(x) c (T J)(x) (T J)(x) +c, x

Hence

(T J)(x) (T J)(x) c, x.Q.E.D.

22


23/125

NEC. AND SUFFICIENT OPT. CONDITION

A stationary policy is optimal if and only if(x) attains the minimum in Bellmans equationfor each x; i.e.,

T J =TJ.

Proof:IfT J =T J, then using Bellmans equa-tion (J =T J), we have

J =TJ,

so by uniqueness of the fixed point ofT, we obtain

J =J; i.e., is optimal.

Conversely, if the stationary policyis optimal,we have J =J, so

J =TJ.

Combining this with Bellmans Eq. (J = T J),we obtain T J =TJ. Q.E.D.

23


24/125


LECTURE 2

LECTURE OUTLINE

Review of discounted problem theory

Review of shorthand notation Algorithms for discounted DP

Value iteration

Policy iteration

Optimistic policy iteration Q-factors and Q-learning

A more abstract view of DP

Extensions of discounted DP

Value and policy iteration Asynchronous algorithms

24


25/125


Stationary system with arbitrary state space

xk+1 =f(xk, uk, wk), k= 0, 1, . . .


N1J k(x0) = lim E

N wkk=0,1,...

k

g

=0

xk, k(xk), wk

with


26/125


Cost function expressions[with J0(x) 0]J(x) = lim (T0T 1 TkJ0)(x), J(x) = lim (T

k J0)(x)

k k

Bellmans equation:J =T J, J=TJ or

J(x) = min E

g(x,u,w) +Jf(x,u,w) , xuU(x) wJ(x) =E

g

x, (x), ww

+J

f(x, (x), w)

, x


: optimal TJ =T J

i.e.,

(x) arg min Eg(x,u,w) +J fuU(x) w (x,u,w) , Value iteration:For any (bounded) J

J(x) = lim (TkJ)(x), xk

26


27/125

MAJOR PROPERTIES

Monotonicity property:For any functionsJandJ on the state space X such that J(x) J(x)for all x X, and any

(T J)(x) (T J)(x), (TJ)(x) (TJ)(x), x X

Contraction property: For any bounded func-tionsJ and J, and any ,

max(T J)(x) (T J)(x)

maxJ(x) J(x)x x

,

max(T JJ)(x)(T )(x) x maxx J(x)J(x).

Compact Contraction Notation:

T JT J JJ, TJTJ JJ,

where for any bounded function J, we denote byJ the sup-norm

J = max J(x) .xX

27


28/125

THE TWO MAIN ALGORITHMS: VI AND PI



x

Policy iteration:Given k


Jk(x) =E

g

x, (x), w

+ Jk

f(x, k(x), w)w

, x

or Jk =TkJk

Policy improvement:Letk+1 be such that

k+1(x)arg min E g(x,u,w) + Jk f(x,u,w) ,

uU(x) w

or Tk+1Jk =T Jk

For finite state space policy evaluation is equiv-

alent to solving a linear system of equations Dimension of the system is equal to the numberof states.

For large problems, exact PI is out of the ques-tion (even though it terminates finitely)

28


29/125

INTERPRETATION OF VI AND PI

29

J = TJ

J = TJ

TJ

TJ

J

45 Degree Line

J = TJ

J

J1 = T1J1

Policy Improvement

Policy Improvement

T1J

Policy Evaluation

J0

J0

J0

J0

TJ0

TJ0

TJ0

T2J0

T2J0

Value Iterations


30/125

JUSTIFICATION OF POLICY ITERATION

We can show that Jk+1 Jk for all k Proof:For given k, we have

Tk+1Jk =T Jk TkJk =Jk

Using the monotonicity property of DP,

Jk T 2 Nk+1Jk Tk+1Jk lim Tk+1JkN

Sincelim TN

k+1Jk =Jk+1

N

we have Jk Jk+1 . IfJk =Jk+1 , thenJk solves Bellmans equa-tion and is therefore equal to J

So at iterationkeither the algorithm generates

a strictly improved policy or it finds an optimalpolicy

For a finite spaces MDP, there are finitely manystationary policies, so the algorithm terminateswith an optimal policy

30


31/125

APPROXIMATE PI

Suppose that the policy evaluation is approxi-mate,

Jk Jk , k= 0, 1, . . .

and policy improvement is approximate,

Tk+1Jk T Jk , k= 0, 1, . . .

where and are some positive scalars.

Error Bound I: The sequence {k} generated

by approximate policy iteration satisfies+ 2

limsup

Jkk

J (1 )2

Typical practical behavior:The method makessteady progress up to a point and then the iteratesJ llate within a neighborhood ofJk osci .

Error Bound II:If in addition the sequence{k}terminates at ,

J

J

+ 2

1 31


32/125

OPTIMISTIC POLICY ITERATION

Optimistic PI (more efficient):This is PI, wherepolicy evaluation is done approximately, with afinite number of VI

So we approximate the policy evaluation

J

Tm J

for some number m [1, ) Shorthand definition:For some integers mk

TkJk =T Jk, Jk+1 =Tmk

k Jk, k= 0, 1, . . .

Ifmk 1 it becomes VI Ifmk = it becomes PI Can be shown to converge (in an infinite numberof iterations)

32


33/125

Q-LEARNING I

We can write Bellmans equation as

J(x) = min Q(x, u),uU(x)

x,

where Q is the unique solution of

Q(x, u) =E

g(x,u,w) + min Q(x, v)

vU(x)

with x=f(x,u,w)

Q(x, u) is called the optimal Q-factorof (x, u)

We can equivalently write the VI method as

Jk+1(x) = min Qk+1(x, u),uU(x)

x,

where Qk+1 is generated by

Qk+1(x, u) =E

g(x,u,w) + min Qk(x, v)

vU(x)

with x=f(x,u,w)

33


34/125

Q-LEARNING II

Q-factors are no different than costs

They satisfy a Bellman equationQ =F Qwhere

(F Q)(x, u) =E

g(x,u,w) + min Q(x, v)

vU(x)

where x=f(x,u,w)

VI and PI for Q-factors are mathematicallyequivalent to VI and PI for costs

They require equal amount of computation ...they just need more storage

Having optimal Q-factors is convenient whenimplementing an optimal policy on-line by

(x) = min Q(x, u)uU(x)

Once Q(x, u) are known, the model [g and

E{}] is not needed. Model-free operation.

Later we will see how stochastic/sampling meth-ods can be used to calculate (approximations of)Q(x, u) using a simulator of the system (no modelneeded)

34


35/125

A MORE GENERAL/ABSTRACT VIEW

Let Y be a real vector space with a norm A function F :Y Y is said to be a contrac-tion mappingif for some (0, 1), we have

F y F z y z, for ally, z Y.

is called the modulus of contractionofF.

Important example:LetXbe a set (e.g., statespace in DP), v : X be a positive-valuedfunction. Let B(X) be the set of all functionsJ : X

such that J(x)/v(x) is bounded over

x. We define a norm onB(X), called the weightedsup-norm, by

J |J(x)| = max .xX v(x)

Important special case:The discounted prob-lem mappings T and T [for v(x) 1, =].

35


36/125

A DP-LIKE CONTRACTION MAPPING

Let X={1, 2, . . .}, and let F :B(X)

B(X)be a linearmapping of the form

(F J)(i) =bi+j

aijJ(j),

X

i= 1, 2, . . .

where bi and aij are some scalars. Then F is acontraction with modulus if and only ifjX|aij | v(j) , i= 1, 2, . . .

v(i)

Let F : B(X) B(X) be a mapping of theform

(F J)(i) = min(FJ)(i), i= 1, 2, . . .M

where M is parameter set, and for each M,F is a contraction mapping from B(X) to B(X)with modulus. ThenFis a contraction mappingwith modulus .

Allows the extension of main DP results frombounded cost to unbounded cost.

36


37/125

CONTRACTION MAPPING FIXED-POINT TH

Contraction Mapping Fixed-Point Theorem:IfF :B(X) B(X) is a contraction with modulus (0, 1), then there exists a unique J B(X)such that

J =F J.

Furthermore, ifJ is any function in B(X), then{FkJ} converges to J and we have

FkJ J kJ J, k= 1, 2, . . . .

This is a special case of a general result for

contraction mappings F : Y Y over normedvector spacesY that arecomplete: every sequence{yk} that is Cauchy (satisfiesym yn 0 asm, n ) converges. The space B(X) is complete (see the text for a

proof).

37


38/125

GENERAL FORMS OF DISCOUNTED DP

We consider an abstract form of DP based on

monotonicity and contraction Abstract Mapping:Denote R(X): set of real-valued functions J :X , and let H :X UR(X) be a given mapping. We consider themapping

(T J)(x) = min H(x,u,J),uU(x)

x X.

We assume that (T J)(x)> for all x X,so T maps R(X) into R(X).

Abstract Policies: Let M be the set of poli-cies, i.e., functions such that (x) U(x) forall x X. For each M, we consider the mappingT :R(X) R(X) defined by

(TJ)(x) =H

x, (x), J

, x X. Find a function J R(X) such that

J(x) = min H(x,u,J), XuU( )

xx

38


39/125

EXAMPLES

Discounted problems(and stochastic shortest

paths-SSP for = 1)

H(x,u,J) =E

g(x,u,w) +J

f(x,u,w)

Discounted Semi-Markov Problems

n

H(x,u,J) =G(x, u) +

mxy (u)J(y)y=1

where mxy are discounted transition probabili-ties, defined by the transition distributions

Shortest Path Problemsaxu+ J(u) ifu=d,H(x,u,J) =axd ifu=d

wheredis the destination. There is also a stochas-tic version of this problem.

Minimax Problems

H(x,u,J) = max g(x,u,w)+J f(x,u,w)wW(x,u)

39


40/125

ASSUMPTIONS

Monotonicity assumption:IfJ, J

R(X) andJ J, then

H(x,u,J) H(x,u,J), x X, u U(x)

Contraction assumption:

For every J B(X), the functions TJ andT Jbelong to B(X).

For some (0, 1), and all and J, J B(X), we have

T J T J J J

We can show all the standard analytical andcomputational results of discounted DP based onthese two assumptions

With just the monotonicity assumption (as inthe SSP or other undiscounted problems) we canstill show various forms of the basic results underappropriate assumptions

40


41/125

RESULTS USING CONTRACTION

Proposition 1: The mappings T and T areweighted sup-norm contraction mappings with mod-ulus over B(X), and have unique fixed pointsin B(X), denoted J and J, respectively (cf.Bellmans equation).

Proof: From the contraction property ofH. Proposition 2:For any J B(X) and M,

lim Tk J=J, lim TkJ=Jk k

(cf. convergence of value iteration).

Proof: From the contraction property ofT andT.

Proposition 3: We have T J = T J if andonly ifJ =J (cf. optimality condition).

Proof: T J = T J, then T J = J , implyingJ = J. Conversely, if J = J , then TJ =TJ =J=J =T J.

41


42/125

RESULTS USING MON. AND CONTRACTION

Optimality of fixed point:

J(x) = min J(x), x XM

Furthermore, for every >0, there exists Msuch that

J(x) J(x) J(x) +, x X

Nonstationary policies: Consider the set ofall sequences = {0, 1, . . .} with k

M for

all k, and define

J(x) = liminf(T0T1 TkJ)(x), xk

X,

with J being any function (the choice of J does

not matter) We have

J(x) = min J(x), x X

42


43/125




x



Jk =TkJk

Policy improvement:Find k+1 such that

Tk+1Jk =T Jk

Optimistic PI:This is PI, where policy evalu-ation is carried out by a finite number of VI

Shorthand definition: For some integers mk

Tk

Jk =T Jk, Jk+1 =T

mk

k Jk, k= 0, 1, Ifmk 1 it becomes VI Ifmk = it becomes PI For intermediate values ofmk, it is generally

more efficient than either VI or PI43


44/125

ASYNCHRONOUS ALGORITHMS

Motivation for asynchronous algorithms

Faster convergence Parallel and distributed computation Simulation-based implementations

General framework: Partition X into disjoint

nonempty subsets X1, . . . , X m, and use separateprocessor updating J(x) for x X Let Jbe partitioned as

J= (J1, . . . , J m),

where J is the restriction ofJon the set X. Synchronous algorithm:

Jt+1 (x) =T(Jt1, . . . , J

tm)(x), x X, = 1, . . . , m

Asynchronous algorithm:For some subsets of

times R,

1(( t)

T J , . . . , J

m(t)

Jt+1 (x) = 1 m )(x) ift R,Jt(x) t /

if R

where t j (t) are communication delays 44


45/125

ONE-STATE-AT-A-TIME ITERATIONS

Important special case:Assume n states, aseparate processor for each state, and no delays

Generate a sequence of states{x0, x1, . . .}, gen-erated in some way, possibly by simulation (eachstate is generated infinitely often)

Asynchronous VI:

Jt+1 T(Jt1, . . . , J

tn)() if=xt,

=

Jt if=x

t,

where T(Jt1, . . . , J tn)() denotes the -th compo-

nent of the vector

T(Jt1, . . . , J tn) =T Jt,

and for simplicity we write Jt instead ofJt()

The special case where

{x0, x1, . . .}={1, . . . , n , 1, . . . , n , 1, . . .}

is the Gauss-Seidel method

We can show that Jt J under the contrac-

tion assumption45


46/125

ASYNCHRONOUS CONV. THEOREM I

Assume that for all, j = 1, . . . , m,Ris infinite

and limtj (t) = Proposition:Let Thave a unique fixed point J,and assume that there is a sequence of nonemptysubsets

S(k)

R(X) with S(k+ 1) S(k) forall k, and with the following properties:

(1) Synchronous Convergence Condition: Ev-ery sequence {Jk} with Jk S(k) for eachk, converges pointwise to J. Moreover, wehave

T J

S(k+1),

J

S(k), k= 0, 1, . . . .

(2) Box Condition:For allk,S(k) is a Cartesianproduct of the form

S(k) =S1(k) Sm(k),

where S(k) is a set of real-valued functionson X, = 1, . . . , m.

Then for every J S(0), the sequence {Jt} gen-erated by the asynchronous algorithm convergespointwise to J.

46


47/125

ASYNCHRONOUS CONV. THEOREM II

Interpretation of assumptions:

S(0)S(k)

S(k+ 1) J

J= (J1, J2)

S1(0)

S2(0)TJ

A synchronous iteration from anyJ inS(k) movesinto S(k+ 1) (component-by-component)

Convergence mechanism:

S(0)S(k)

S(k+ 1) J

J= (J1, J2)

J1 Iterations

J2 Iteration

Key: Independent component-wise improve-ment.An asynchronous component iteration fromany J in S(k) moves into the corresponding com-ponent portion ofS(k+ 1)

47


48/125


LECTURE 3

LECTURE OUTLINE

Review of theory and algorithms for discountedDP

MDP and stochastic shortest path problems(briefly)

Introduction to approximation in policy andvalue space

Approximation architectures Simulation-based approximate policy iteration

Approximate policy iteration and Q-factors

Direct and indirect approximation

Simulation issues

48


49/125


Stationary system with arbitrary state space

xk+1 =f(xk, uk, wk), k= 0, 1, . . .


N1J k(x0) = lim E g xk, k(xk), wk

N wkk=0,1,...

k

=0

with


50/125

MDP - TRANSITION PROBABILITY NOTATIO

Assume the system is an n-state (controlled)Markov chain

Change to Markov chain notation

States i= 1, . . . , n (instead ofx) Transition probabilitiespikik+1(uk)[instead

ofxk+1=f(xk, uk, wk)] Stage cost g(ik, uk, ik+1)[instead ofg(xk, uk, wk Cost of a policy ={0, 1, . . .}

N1

J(i) = lim E kg ik, k(ik), ik+1 i0=N ik

k=1,2,... k=0 |

Shorthand notation for DP mappings

n

(T J)(i) = min pij (u)uU(i)j=1

g(i,u,j)+J(j), i= 1, . . . ,n

(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .

j=1

50


51/125


Cost function expressions[with J0(i) 0]J(i) = lim (T0T 1 TkJ0)(i), J(i) = lim (T

k J0)(i)

k k

Bellmans equation:J =T J, J=TJ or

n

J(i) = min

p ij (u) g(i,u,j) +J (j) , iuU(i)

j=1

n

J(i) =

pij

(i)

g

i, (i), j

+ J(j)

j=1 , i


: optimal TJ =T J

i.e.,

n

(i) arg min

p ij (u) (u (i)

j=1

g i,u,j)+J (j)

, i

U

51


52/125


Value iteration:For any J n

J(i) = lim (TkJ)(i),k

i= 1, . . . , n



n

Jk(i) =

p k k

ij

(i)

g

i, (i), j

+Jk (j)

, i= 1, . . .

j=1

or Jk =TkJk

Policy improvement:Letk+1 be such thatn

k+1(i)arg min

pij (u)

g(i,u,j)+Jk(j)

uU(i)j=1

, i

or Tk+1Jk =T Jk

Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)

52


53/125

TOCHASTIC SHORTEST PATH (SSP) PROBLE

Involves states i= 1, . . . , nplus a special cost-free and absorbing termination state t

Objective: Minimize the total (undiscounted)cost. Aim: Reach t at minimum expected cost

An example: Tetris

!"#$%&'!%(&

))))))

53
http://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairuse


54/125

SSP THEORY

SSP problems provide a soft boundary be-tween the easy finite-state discounted problemsand the hard undiscounted problems.

They share features of both. Some of the nice theory is recovered because

of the termination state.

Definition: A proper policy is a stationarypolicy that leads to t with probability 1

If all stationary policies are proper, T andT are contractions with respect to a commonweighted sup-norm

The entire analytical and algorithmic theory fordiscounted problems goes through if all stationarypolicies are proper (we will assume this)

There is a strong theory even if there are im-proper policies (but they should be assumed to benonoptimal - see the textbook)

54


55/125

GENERAL ORIENTATION TO ADP

We will mainly adopt an n-state discounted

model (the easiest case - but think of HUGE n). Extensions to SSP and average cost are possible(but more quirky). We will set aside for later.

There are many approaches:

Manual/trial-and-error approach

Problem approximation Simulation-based approaches (we will focus

on these): neuro-dynamic programmingorreinforcement learning.

Simulation is essential for large state spaces

because of its (potential) computational complex-ity advantage incomputing sums/expectations in-volving a very large number of terms.

Simulation also comes in handy whenan ana-lytical model of the system is unavailable, but a

simulation/computer model is possible. Simulation-based methods are of three types:

Rollout (we will not discuss further) Approximation in value space

Approximation in policy space55


56/125

APPROXIMATION IN VALUE SPACE

Approximate J

or J from a parametric classJ(i, r)where i is the current state and r= (r1, . . . , ris a vector of tunable scalars weights.

By adjusting r we can change the shape ofJso that it is reasonably close to the true optimalJ.

Two key issues:

The choice of parametric class J(i, r) (theapproximation architecture).

Method for tuning the weights (trainingthe architecture).

Successful application strongly depends on howthese issues are handled, and on insight about theproblem.

A simulator may be used, particularly whenthere is no mathematical model of the system (butthere is a computer model).

We will focus on simulation, but this is not theonly possibility[e.g.,J(i, r) may be a lower boundapproximation based on relaxation, or other prob-lem approximation]

m)

56


57/125

APPROXIMATION ARCHITECTURES

Divided in linear and nonlinear[i.e., linear ornonlinear dependence ofJ(i, r) onr].

Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

Computer chess example: Uses a feature-based

position evaluator that assigns a score to eachmove/position.

FeatureExtraction

Weightingof Features

Score

Features:

Material balance,

Mobility,

Safety, etc

Position Evaluator

Many context-dependent special features. Most often the weighting of features is linearbut multistep lookahead is involved.

In chess, most often the training is done by trialand error.

57


58/125

LINEAR APPROXIMATION ARCHITECTURE

Ideally, the features encode much of the nonlin-earity inherent in the cost-to-go approximated

Then the approximation may be quite accuratewithout a complicated architecture.

With well-chosen features, we can use a linear

architecture:J(i, r) =(i)

r,i = 1, . . . , n, or morecompactlyJ(r) = r

: the matrix whose rows are (i), i= 1, . . . , n

State i Feature ExtractionMapping Mapping

Feature Vector (i) Linear

Linear CostApproximator (i)r

This is approximation on the subspace

S={r|r s}spanned by the columns of

(basis functions)

Many examples of feature types: Polynomialapproximation, radial basis functions, kernels ofall sorts, interpolation, and special problem-specific(as in chess and tetris)

58


59/125

APPROXIMATION IN POLICY SPACE

A brief discussion; we will return to it at theend.

We parameterize the set of policies by a vectorr= (r1, . . . , rs) and we optimize the cost over r

Discounted problem example:

Each value of r defines a stationary policy,with cost starting at state i denoted by J(i; r).

Use a random search, gradient, or other methodto minimize over r

n

J(r) =

piJ(i; r),i=1

where (p1, . . . , pn) is some probability distri-bution over the states.

In a special case of this approach, the param-

eterization of the policies is indirect, through anapproximate cost function.

A cost approximation architecture parame-terized byr, defines a policy dependent onrvia the minimization in Bellmans equation.

59


60/125

APPROX. IN VALUE SPACE - APPROACHES

Approximate PI(Policy evaluation/Policy im-provement)

Uses simulation algorithms to approximatethe cost J of the current policy

Projected equation and aggregation approaches

Approximation of the optimal costfunction J

Q-Learning: Use a simulation algorithm toapproximate the optimal costs J(i) or theQ-factors

n

Q

(i, u) =g(i, u) +

pij (u)J

(j)j=1

Bellman error approach:Find r to2

min EiJ(i, r) (T J)(i, r)r where Ei{} is taken with respect to somedistribution

Approximate LP(we will not discuss here)60


61/125

APPROXIMATE POLICY ITERATION

General structure

System Simulator

Decision GeneratorDecision(i) S

Cost-to-Go Approximatorn Generatorr Supplies Values J(j, r) D

Cost Approximation

n Algorithm

J(j, r

State i

J(j, r) is the cost approximation for the pre-ceding policy, used by the decision generator tocompute the current policy [whose cost is ap-proximated by J(j, r) using simulation]

There are several cost approximation/policyevaluation algorithms

There are several important issues relating tothe design of each block (to be discussed in the

future).61

Samples


62/125

POLICY EVALUATION APPROACHES I

Direct policy evaluation Approximate the cost of the current policy byusing least squares and simulation-generated costsamples

Amounts to projection ofJ onto the approxi-

mation subspace

Subspace S={r| r s}

= 0

J

J

Direct Method: Projection of

cost vector J

Solution of the least squares problem by batchand incremental methods

Regular and optimistic policy iteration

Nonlinear approximation architectures may alsobe used

62


63/125

POLICY EVALUATION APPROACHES II

Indirect policy evaluation

S: Subspace spanned by basis functions

T(!rk)= g + "P!rk

0

Value Iterate

Projectionon S

!rk+1

Simulation error


!rk

T(!rk)= g + "P!rk

0

!rk+1

Value Iterate

Projectionon S

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

!rk

An example of indirect approach: Galerkin ap-proximation

Solve theprojected equation r= T(r)where is projection w/ respect to a suit-able weighted Euclidean norm

TD(): Stochastic iterative algorithm for solv-ing r= T(r)

LSPE(): A simulation-based form of pro-jected value iteration

rk+1= T(rk) + simulation noise

LSTD(): Solves a simulation-based approx-

imation w/ a standard solver (Matlab)63


64/125

POLICY EVALUATION APPROACHES III

Aggregation approximation: Solve

r= DT(r)

where the rows ofDand are prob. distributions(e.g., D and aggregate rows and columns of

the linear system J=TJ).

pij(u),

dxi jy

i

x y

OriginalSystem States

Aggregate States

pxy(u) =n

i=1

dxi

n

j=1

pij(u)jy ,

DisaggregationProbabilities

AggregationProbabilities

g(x, u) =n

i=1

dxi

n

j=1

pij(u)g(i,u,j)

, g(i, u, j)

Several different choices ofD and .

64

j


65/125

POLICY EVALUATION APPROACHES IV

pij(u),

dxi jy

i

x y

Original

System States

Aggregate States

pxy(u) =n

i=1

dxi

n

j=1

pij(u)jy ,

Disaggregation

Probabilities

Aggregation

Probabilities

g(x, u) =

n

i=1

dxi

n

j=1

pij(u)g(i,u,j)

, g(i, u, j)

|

Aggregation is a systematic approach for prob-

lem approximation. Main elements: Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)

Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the

original problem

Because an exact PI algorithm is used to solvethe approximate/aggregate problem the methodbehaves more regularly than the projected equa-tion approach

65

j


66/125

THEORETICAL BASIS OF APPROXIMATE PI

If policies are approximately evaluated using anapproximation architecture such that

max |J(i, rk) Jk(i)i

| , k= 0, 1, . . .

If policy improvement is also approximate,

max |(Tk+1J)(i, rk) (i

T J)(i, rk)| , k= 0, 1, . .

Error bound:The sequence {k} generated by

approximate policy iteration satisfies+ 2

lim sup max

Jk(i)ik

J(i) (1 )2

Typical practical behavior:The method makes

steady progress up to a point and then the iteratesJk oscillate within a neighborhood of J.

66


67/125

THE USE OF SIMULATION - AN EXAMPLE

Projection by Monte Carlo Simulation: Com-pute the projection J of a vector J n onsubspace S = {r | r s}, with respect to aweighted Euclidean norm . Equivalently, find r, where

n

r = arg min rJ2 = arg min (i)i r (is r

i=1

J )r s

Setting to 0 the gradient at r,

r = 1n n

i(i)(i)i=1

i(i)J(i)i=1

Approximate by simulation the two expectedvalues

rk = 1

k k

(i (i t) t)t=1

(it)J(it)t=1

Equivalent least squares alternative:

k2

r k = arg min (it)r J(it)rs

t

=1

67


68/125

THE ISSUE OF EXPLORATION

To evaluate a policy, we need to generate cost

samples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under .

As a result, the cost-to-go estimates of theseunderrepresented states may be highly inaccurate.

This seriously impacts the improved policy. This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-tively small (e.g., a deterministic system).

One possibility for adequate exploration: Fre-quently restart the simulationand ensure that theinitial states employed form a rich and represen-tative subset.

Another possibility: Occasionally generate tran-sitions thatuse a randomly selected controlrather

than the one dictated by the policy .

Other methods, to be discussed later,use twoMarkov chains(one is the chain of the policy andis used to generate the transition sequence, theother is used to generate the state sequence).

68


69/125

APPROXIMATING Q-FACTORS

The approach described so far for policy eval-uation requires calculating expected values [andknowledge ofpij (u)] for all controls u U(i). Model-free alternative:Approximate Q-factors

n

Q(i,u,r) pij (u)j=1

g(i,u,j) +J(j)

and use for policy improvement the minimization

(i) = arg min Q(i,u,r)uU(i)

ris an adjustable parameter vector and Q(i,u,r)is a parametric architecture, such as

s

Q(i,u,r) =

rmm(i, u)m=1

We can use any approach for cost approxima-tion, e.g., projected equations, aggregation.

Use the Markov chain with states (i, u) -pij ((i))is the transition prob. to (j, (i)), 0 to other (j, u).

Major concern:Acutely diminished exploration.69


70/125

6.231 DYNAMIC PROGRAMMING

LECTURE 4

LECTURE OUTLINE

Review of approximation in value space

Approximate VI and PI Projected Bellman equations

Matrix form of the projected equation

Simulation-based implementation

LSTD and LSPE methods Optimistic versions

Multistep projected Bellman equations

Bias-variance tradeoff

70


71/125

DISCOUNTED MDP

System: Controlled Markov chain with states

i= 1, . . . , nand finite set of controls u U(i) Transition probabilities: pij (u)

! "

#!"!$"

#!!!$" #" "!$"

#"!!$"

Cost of a policy = {0, 1, . . .} starting atstate i:

N

J(i) = lim EN

kg ik, k(ik), ik+1 i=i0k

=0 |

with [0, 1) Shorthand notation for DP mappings

n

(T J)(i) = min

pij (u)

g(i,u,j)+J(j)

, i= 1, . . . ,uU(i)

j=1

n

(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .

j=1

71


72/125


Bellmans equation:J

=T J

, J=TJ orn

J(i) = min

pij (u)

g(i,u,j) +J(j) , iuU(i)

j=1

n

J(i) =

pij

(i) i,=1

g

(i), j

+ J(j)j

, i


: optimal T J =T J

i.e.,

n

(i) arg minuU(i)

pij (u)j=1

g(i,u,j)+J(j)

, i

72


73/125




i= 1, . . . , n



n

Jk(i) =

p k k

ij

(i)

g

i, (i), j

+Jk (j)

, i= 1, .

j=1

or Jk =TkJk


k+1(i)arg min

pij (u)

g(i,u,j)+Jk(j)

uU(i)j=1

, i

or Tk+1Jk =T Jk

Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)

73


74/125

APPROXIMATION IN VALUE SPACE

Approximate J

or J from a parametric classJ(i, r), where i is the current state and r= (r1, . . . , rmis a vector of tunable scalars weights.

By adjustingr we can change the shape ofJso that it is close to the true optimal J.

Any r s

defines a (suboptimal) one-steplookahead policy

n

(i) = arg min

pij (u)

g(i,u,j)+J(j, r)uU(i)

j=1

, i

We will focus mostly on linear architectures

J(r) = r

where is an n s matrix whose columns areviewed as basis functions

Think n: HUGE, s: (Relatively) SMALL

For J(r) = r, approximation in value spacemeans approximation ofJ or J within the sub-space

S= r r s{ |

}

74


75/125

APPROXIMATE VI

Approximates sequentially Jk(i) = (TkJ0)(i),

k= 1, 2, . . ., with Jk(i, rk) The starting function J0 is given (e.g., J0 0) After a large enough numberNof steps,JN(i, rN)is used as approximation J(i, r) to J(i)

Fitted Value Iteration: A sequential fit toproduce Jk+1 from Jk, i.e., Jk+1 T Jk or (for asingle policy ) Jk+1 TJk

For a small subset Sk of statesi, computen

(T Jk)(i) = min puU(i)

ij (u)j=1

g(i,u,j) +Jk(j, r) Fit the function Jk+1(i, rk+1) to the small

set of values (T Jk)(i), i Sk Simulation can be used for model-free im-

plementation

Error Bound: If the fit is uniformly accuratewithin >0 (i.e., maxi|Jk+1(i) T Jk(i)| ),

2lim sup max Jk(i, rk)

ik =1,...,n J(i)

(1 )2

75


76/125

AN EXAMPLE OF FAILURE

Consider two-state discounted MDP with states1 and 2, and a single policy.

Deterministic transitions: 12 and 2

2 Transition costs 0, so J (1) =J(2) = 0.

Consider approximate VI scheme that approxi-

mates cost functions inS=

(r, 2r)|r 1

a weighted least squares fit; here =2

with

Given Jk = (rk, 2rk), we find Jk+1= (rk+1, 2rk+1),where for weights 1, 2>0, rk+1 is obtained as

2rk+1 = arg min

1

r(T Jk)(1)2

+2 2r(T Jk)(2)r

With straightforward calculation

rk+1 =rk, where= 2(1+22)/(1+42)>1

So if >1/, the sequence {rk} diverges andso does {Jk}.

Difficulty is that T is a contraction, but T(= least squares fit composed with T) is not

Norm mismatch problem76


77/125

APPROXIMATE PI

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate Cost

J(r) = r Using Simulation

Generate Improved Policy

Evaluation of typical policy:Linear cost func-tion approximation J(r) = r, where is furank n s matrix with columns the basis func-tions, and ith row denoted (i).

Policy improvementto generate :n

(i) = arg min

pij (u)

g(i,u,j) +(j)ruU(i)

j=1

Error Bound:If

max |Jk(i, rk) Jk(i)| , k= 0, 1, . . .iThe sequence {k} satisfies

2lim sup max J k(i) J (i)

ik

(1 )2

ll

77


78/125

POLICY EVALUATION

Lets consider approximate evaluation of thecost of the current policy by using simulation.

Direct policy evaluation- Cost samples gen-erated by simulation, and optimization byleast squares

Indirect policy evaluation- solving the pro-jected equation r = T(r) where isprojection w/ respect to a suitable weightedEuclidean norm


0

#J

Projectionon S


T(!r)

0

!r = #T(!r)

Projectionon S

J

Direct Mehod: Projection of cost vector J Indirect method: Solving a projectedform of Bellmans equation

Recall that projection can be implemented bysimulation and least squares

78


79/125

WEIGHTED EUCLIDEAN PROJECTIONS

Consider a weighted Euclidean norm

J = n i

i=1

J(i)

2,

where is a vector of positive weights 1, . . . , n. Let denote the projection operation onto

S={r|r s}

with respect to this norm, i.e., for any J

n,

J= r

wherer = arg min

rsJr2

79


80/125

PI WITH INDIRECT POLICY EVALUATION

Approximate Policy

Evaluation

Policy Improvement





Given the current policy :

We solve the projected Bellmans equation

r= T(r)

We approximate the solution Jof Bellmansequation

J=TJ

with the projected equation solution J(r)

80


81/125

KEY QUESTIONS AND RESULTS

Does the projected equation have a solution?

Under what conditions is the mapping T acontraction, so T has unique fixed point?

Assuming T has unique fixed point r , howclose is r to J?

Assumption: The Markov chain correspondingtohas a single recurrent class and no transientstates, i.e., it has steady-state probabilities thatare positive

N1

j = lim P(ik =jNN k=1

|i0 =i)>0

Proposition: (Norm Matching Property)

(a) T is contraction of modulus with re-spect to the weighted Euclidean norm ,where = (1, . . . , n) is the steady-state

probability vector.

(b) The unique fixed point r ofT satisfies

1J r1 2

J J81


82/125

PRELIMINARIES: PROJECTION PROPERTIE

Important property of the projection on Swith weighted Euclidean norm . For all Jn, J S, thePythagorean Theoremholds:

J J2 = JJ2+ J J2

Proof:Geometrically, (J

J) and (

J J) areorthogonal in the scaled geometry of the norm , where two vectors x, y n

are orthogonal

ifn

i=1ixiyi = 0. Expand the quadratic in theRHS below:

J J2 = (JJ) + ( J J)

2

The Pythagorean Theorem implies that thepro-jection is nonexpansive, i.e.,

JJ J J, for allJ, J n.

To see this, note that

(J J)2 2 2(J J)

+

(I)(J J)

= J J 2

82


83/125

PROOF OF CONTRACTION PROPERTY

Lemma: IfP is the transition matrix of,

P z z, z n

Proof: Let pij be the components of P. For allz n, we have

n n

P z2 = i 2

n npij zn

2j

i=1 =1

i pij zjj i=1 j=1

n n

=j=1

ipij z2j =

i=1

j z2j =

j=1

z2 ,

where the inequality follows from the convexity ofthe quadratic function, and the next to last equal-ity follows from the defining property

ni=1ipij =

j of the steady-state probabilities.

Using the lemma, the nonexpansiveness of,

and the definition TJ=g+P J, we have

TJTJ TJTJ =P(JJ) JJ

for all J, J n. Hence T is a contraction ofmodulus .

83


84/125

PROOF OF ERROR BOUND

Let r

be the fixed point ofT. We have

1J r J J .1 2

Proof:We have

2J 2 2 r = J J+J r

J

= J 22+

T J T(r) J J2+2J r2,

where The first equality uses the Pythagorean The-

orem

The second equality holds because J is thefixed point ofT and r is the fixed point

of

T The inequality uses the contraction propertyofT.

Q.E.D.

84


85/125

MATRIX FORM OF PROJECTED EQUATION

Its solution is the vector J = r

, where r

solves the problem

2min

r (g+Pr)rs

.

Setting to 0 the gradient with respect to r ofthis quadratic, we obtain

r (g+Pr) = 0,

where is the diagonal matrix wi

th the steady-

state probabilities 1, . . . , n along the diagonal.

This is just the orthogonality condition: Theerror r (g +Pr) is orthogonal to thesubspace spanned by the columns of.

Equivalently,

Cr =d,

where

C= (I P), d= g.

85


86/125

PROJECTED EQUATION: SOLUTION METHOD

Matrix inversion: r

=C1

d Projected Value Iteration (PVI) method:

rk+1 = T(rk) = (g+Prk)

Converges to r becauseT is a contraction.


!rk

T(!rk)= g + "P!rk

0

!rk+1

Value Iterate

Projectionon S

PVI can be written as:

rk+1 = arg minrs

2r (g+Prk)By setting to 0 the gradient with respect

to r,

rk+1 (g+Prk)

= 0,

which yields

r =r ()1(Cr d)k+1 k

k

86


87/125

SIMULATION-BASED IMPLEMENTATIONS

Key idea:Calculate simulation-based approxi-mations based on k samples

Ck C, dk d

Matrix inversion r = C1d is approximated

byrk =C

1k dk

This is the LSTD(Least Squares Temporal Dif-ferences) Method.

PVI methodrk+1 =rk

()1(Crk

d) is

approximated by

rk+1 =rk Gk(Ckrk dk)

whereG

( k ) 1

This is the LSPE(Least Squares Policy Evalua-tion) Method.

Key fact: Ck, dk, and Gk can be computedwith low-dimensional linear algebra (of order s;the number of basis functions).

87


88/125

SIMULATION MECHANICS

Wegenerate an infinitely long trajectory (i0, i1, . . .)of the Markov chain, so states i and transitions

(i, j) appear with long-term frequenciesiandpij .

After generating the transition (it, it+1), wecompute the row (i )t of and the cost com-ponent g(it, it+1).

We form

k1

Ck =

(i (i t)

(it) t+1)

+ 1=0

(IP)

kt

k1

dk = (it)g(it, it+1)k+ 1 t=0

g

Also in the case of LSPE

k1

Gk = (i t)

t)(i k+ 1

t=0

Convergence based on law of large numbers.

Ck, dk, and Gk can be formed incrementally.Also can be written using the formalism of tem-poral differences(this is just a matter of style)

88


89/125

OPTIMISTIC VERSIONS

Instead of calculating nearly exact approxima-tions Ck C and dk d, we do a less accurateapproximation, based on few simulation samples

Evaluate (coarsely) current policy , then do apolicy improvement

This often leads to faster computation (as op-timistic methods often do)

Very complex behavior (see the subsequent dis-cussion on oscillations)

The matrix inversion/LSTD method has serious

problems due to large simulation noise (because oflimited sampling)

LSPE tends to cope better because of its itera-tive nature

A stepsize (0, 1] in LSPE may be useful todamp the effect of simulation noise

rk+1 =rk Gk(Ckrk dk)

89


90/125

MULTISTEP METHODS

Introduce a multistep version of Bellmans equa-tion J=T()J, where for

[0, 1),

T() = (1 )

T+1

=0

Geometrically weighted sum of powers ofT.

Note that T is a contraction with modulus

, with respect to the weighted Euclidean norm, whereis the steady-state probability vectorof the Markov chain.

Hence T() is a contraction with modulus

= (1 ) (1+1 = )1 =0

Note that 0 as 1 Tt and T() have the same fixed point J and

1J r J J 1 2

where r is the fixed point ofT().

The fixed point r depends on . 90


91/125

BIAS-VARIANCE TRADEOFF

Subspace S={r| r s}

J

Simulation errorJ

Bias

= 0

= 1

Solution of projected equation

Simulation error

r= T()(r)

Error bound Jr 1 2J 1

J

As 1, we have 0, so error bound (andthe quality of approximation) improves as 1.In fact

limr = J1

But the simulation noise in approximatingT() = (1 ) T+1

=0

increases

Choice of is usually based on trial and error91


92/125

MULTISTEP PROJECTED EQ. METHODS

The projected Bellman equation is

r= T()(r)

In matrix form: C()r=d(), where

C() =

I P(), d() = g(),with P() = (1 )

P+1, g() =

=0

Pg

=0

The LSTD() methodis

(C

) 1 (

k d

)

k ,

( ) ( )whereC

andd

k k

are

simulation-based approx-

imations ofC() andd().

The LSPE() methodis

()

()

rk+1 =rk Gk Ck rk dk

where Gkis a simulation-b

ased approx. to

()1

TD():An important simpler/slower iteration[similar to LSPE() with Gk =I - see the text].

92


93/125

MORE ON MULTISTEP METHODS

( ) ( )

The simulation process to obtainC

k andd

k

is similar to the case = 0 (single simulation tra-jectory i0, i1, . . . more complex formulas)

k k(

C ) 1

=

(i )

mttk

mt(im)(im+1)

k+ 1t=0 m=t

k k(

d ) 1

=

(i mt mtt ik ) k+ 1t m

g m

=0 =t

In the context of approximate policy iteration,we can use optimistic versions (few samples be-

tween policy updates).

Many different versions (see the text).

Note the -tradeoffs:

As (1, C ) (k and d )k contain more sim-ulation noise, so more samples are neededfor a close approximation ofr (the solutionof the projected equation)

The error bound Jrbecomes smaller As 1, T() becomes a contraction for

arbitraryprojection norm93


94/125


LECTURE 5

LECTURE OUTLINE

Review of approximate PI

Review of approximate policy evaluation basedon projected Bellman equations

Exploration enhancement in policy evaluation

Oscillations in approximate PI

Aggregation An alternative to the projected

equation/Galerkin approach

Examples of aggregation

Simulation-based aggregation

94


95/125

DISCOUNTED MDP

System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u

U(i)

Transition probabilities: pij (u)

! "

#!"!$"

#!!!$" #" "!$"

#"!!$"


N

J(i) = lim EN


=0 |


n

(T J)(i) = min

pij (u)

g(i,u,j)+J(j)

, i= 1, . . . , n ,uU(i)

j=1

n

(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . , n

j=1

95


96/125

APPROXIMATE PI

Approximate Policy

Evaluation

Policy Improvement





Evaluation of typical policy:Linear cost func-tion approximation

J(r) = r

where is full rank n s matrix with columns

the basis functions, and ith row denoted

(i)

. Policy improvementto generate :

n

(i) = arg min pij (u) g(i,u,j) +(j)ruU(i)

j=1

96


97/125

EVALUATION BY PROJECTED EQUATIONS

We discussed approximate policy evaluation bysolving the projected equation

r= T(r)

: projection with a weighted Euclidean norm

Implementation by simulation (single long tra-

jectory using current policy - important to makeT a contraction). LSTD, LSPE methods.

( )

Multistep option:Solve r= T

(r) with

(T

)= (1 )

T+1

=0 As 1, T() becomes a contraction forany projection norm

Bias-variance tradeoff

Subspace S={r| r s}

J

Simulation errorJ

Bias

= 0

= 1

Solution of projected equation

Simulation error

r= T()(r)

97


98/125

POLICY ITERATION ISSUES: EXPLORATION

1st major issue: exploration. To evaluate ,we need to generate cost samples using

This biases the simulation by underrepresentingstates that are unlikely to occur under .

As a result, the cost-to-go estimates of these

underrepresented states may be highly inaccurate. This seriously impacts the improved policy.

This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-

tively small (e.g., a deterministic system). Common remedy is theoff-policy approach:Re-place Pof current policy with a mixture

P = (I B)P+ BQwhereB is diagonal with diagonal components in[0, 1] and Q is another transition matrix.

LSTD and LSPE formulas must be modified ...otherwise the policy P (not P) is evaluated. Re-lated methods and ideas: importance sampling,geometric and free-form sampling(see the text).

98


99/125

POLICY ITERATION ISSUES: OSCILLATIONS

2nd major issue: oscillation of policies Analysis using the greedy partition: R is theset of parameter vectors r for which is greedywith respect to J(, r) = r

R= r|T(r) =T(r) There is a finite number of possible

vectors r,

one generated from another in a deterministic way

rk

rk+1

rk+2

rk+3

Rk

Rk+1

Rk+2

Rk+3

The algorithm ends up repeating some cycle ofpolicies k, k+1, . . . , k+m with

rk Rk+1 , rk+1 Rk+2 , . . . , rk+m Rk ;

Many different cycles are possible99


100/125

MORE ON OSCILLATIONS/CHATTERING

In the case of optimistic policy iteration a dif-ferent picture holds

r1

r2

r3

R1

R2

R3

Oscillations are less violent, but the limitpoint is meaningless!

Fundamentally, oscillations are due to thelackof monotonicity of the projection operator, i.e.,J J does not imply J J. If approximate PI uses policy evaluation

r= (W T)(r)

withWa monotone operator, the generated poli-cies converge (to a possibly nonoptimal limit).

The operator W used in the aggregation ap-proach has this monotonicity property.

100


101/125

PROBLEM APPROXIMATION - AGGREGATIO

Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem.

The simplification is often ad-hoc/problem-depende

Aggregation is a systematic approach for prob-

lem approximation. Main elements: Introduce a few aggregate states,viewedas the states of an aggregate system

Define transition probabilities and costs ofthe aggregate system, by relating originalsystem states with aggregate states

Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)

Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the

original problem Hard aggregation example: Aggregate statesare subsets of original system states, treated as ifthey all have the same cost.

101


102/125

AGGREGATION/DISAGGREGATION PROBS

dxi jy

i

x y


Aggregate States

Disaggregation

Probabilities

Aggregation

Probabilities

MatrixD Matrix

|

The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices

For each original system statej and aggregatestatey, theaggregation probability jy

Roughly, the degree of membership ofj inthe aggregate state y.

In hard aggregation, jy = 1 if state j be-longs to aggregate state/subset y.

For each aggregate state xand original systemstate i, thedisaggregation probability dxi

Roughly, the degree to which i is represen-tative ofx.

In hard aggregation, equal dxi

102

jpij(u)


103/125

AGGREGATE SYSTEM DESCRIPTION

The transition probability from aggregate statex to aggregate state y under control u

n n

pxy(u) =

dxi ori

pij (u)jy, P(u) =DP(u)

=1 j=1

where the rows ofD and are the disaggregationand aggregation probs.

The expected transition cost is

n n

g(x, u) =

dxi

pij (u)g(i,u,j), or g =DP gi=1 j=1

The optimal cost function of the aggregate prob-lem, denoted R, is

R(x) = min

g(x, u) +

pxy(u)R(y)

uUy

, x

Bellmans equation for the aggregate problem.

The optimal cost function J of the originalproblem is approximated by J given by

J(j) =

jy R(y),

y

j

103


104/125

EXAMPLE I: HARD AGGREGATION

Group the original system states into subsets,and view each subset as an aggregate state

Aggregation probs.: jy = 1 i f j belongs toaggregate state y.

Disaggregation probs.: There are many possi-bilities, e.g., all states i within aggregate state xhave equal prob. dxi.

If optimal cost vector J is piecewise constant

over the aggregate states/subsets, hard aggrega-tion is exact.Suggests grouping states with roughlyequal cost into aggregates.

A variant: Soft aggregation (provides softboundaries between aggregate states).

104

1 2 3

4 5 6

7 8 9

x1 x2

x3 x4

=

1 0 0 0

1 0 0 00 1 0 0

1 0 0 0

1 0 0 0

0 1 0 0

0 0 1 0

0 0 1 0

0 0 0 1


105/125

EXAMPLE II: FEATURE-BASED AGGREGATIO

Important question: How do we group statestogether?

If we know good features, it makes sense togroup together states that have similar features

States Aggregate StatesFeatures

FeatureExtraction

A general approach for passing from a feature-based state representation to an aggregation-basedarchitecture

Essentially discretize the features and generatea corresponding piecewise constant approximationto the optimal cost function

Aggregation-based architecture is more power-ful(nonlinear in the features)

... but may require many more aggregate statesto reach the same level of performance as the cor-responding linear feature-based architecture

105


106/125

EXAMPLE III: REP. STATES/COARSE GRID

Choose a collection of representative originalsystem states, and associate each one of them withan aggregate state

x j

j2

j3

y1 y2

y3

Original State Space

Representative/Aggregate States

Disaggregation probabilities are dxi = 1 if i isequal to representative state x.

Aggregation probabilities associate original sys-tem states with convex combinations of represen-tative states

j y jy yA

Well-suited for Euclidean space discretization

Extends nicely to continuous state space, in-cluding belief space of POMDP

106

j1


107/125

EXAMPLE IV: REPRESENTATIVE FEATURES

Here the aggregate states are nonempty sub-sets of original system states (but need not forma partition of the state space)

Example: Choose a collection of distinct rep-resentative feature vectors, and associate each ofthem with an aggregate state consisting of originalsystem states with similar features

Restrictions:

The aggregate states/subsets are disjoint. The disaggregation probabilities satisfy dxi >

0 if and only ifi

x.

The aggregation probabilities satisfyjy = 1for all j y.

If every original system state ibelongs to someaggregate state we obtain hard aggregation

If every aggregate state consists of a single orig-inal system state, we obtain aggregation with rep-resentative states

With the above restrictions D =I, so (D)(D) =D, and Dis an oblique projection(orthogonalprojection in case of hard aggregation)

107


108/125

APPROXIMATE PI BY AGGREGATION

pij(u),

dxi jy

i

x y


Aggregate States

pxy(u) =n

i=1

dxi

n

j=1

pij(u)jy ,

Disaggregation

Probabilities

Aggregation

Probabilities

g(x, u) =n

i=1

dxi

n

j=1

pij(u)g(i,u,j)

, g(i, u, j)

Consider approximate policy iteration for theoriginal problem, with policy evaluation done by

aggregation. Evaluation of policy : J = R, where R =DT(R) (R is the vector of costs of aggregatestates for ). Can be done by simulation.

Looks like projected equation R= T(R)

(but with D in place of). Advantages:It has no problem with explorationor with oscillations.

Disadvantage: The rows of D and must beprobability distributions.

108

j


109/125

DISTRIBUTED AGGREGATION I

We consider decomposition/distributed solu-tionof large-scale discounted DP problems by ag-gregation.

Partition the original system states into subsetsS1, . . . , S m

Each subset S, = 1, . . . , m:

Maintains detailed/exact local costsJ(i) for every original system statei S

using aggregate costs of other subsets

Maintains an aggregate cost R() = iS diJ(i)

Sends R() to other aggregate states J(i) and R() are updated by VI according to

Jk+1(i) = min H(i,u,Jk, Rk),uU(i)

i Swith Rk being the vector ofR() at time k, and

n

H(i,u,J,R) =

pij (u)g(i , u , j) +

j=1 j

pij (u)J(j)

S

+

jS pij (u)R(

)

, =

109


110/125

DISTRIBUTED AGGREGATION II

Can show that this iteration involves a sup-norm contraction mapping of modulus , so itconverges to the unique solution of the system ofequations in (J, R)

J(i) = min H(i,u,J,R), R() =uU(i) i

diJ(i),S

i S, = 1, . . . , m .

This follows from the fact that {di | i =1, . . . , n} is a probability distribution.

View these equations as a set of Bellman equa-tions for an aggregate DP problem.The differ-ence is that the mapping H involves J(j) ratherthan R x(j) for j S. In an

asyn

chronous version of the method, the

aggregate costs R() may be outdated to accountfor communication delays between aggregate states.

Convergence can be shown using the generaltheory of asynchronous distributed computation(see the text).

110


111/125


LECTURE 6

LECTURE OUTLINE

Review of Q-factors and Bellman equations forQ-factors

VI and PI for Q-factors

Q-learning - Combination of VI and sampling

Q-learning and cost function approximation

Approximation in policy space

111


112/125

DISCOUNTED MDP

System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u

U(i)

Transition probabilities: pij (u)

! "

#!"!$"

#!!!$" #" "!$"

#"!!$"


N

J(i) = lim EN


=0 |


n

(T J)(i) = min

pij (u)

g(i,u,j)+J(j)

, i= 1, . . . , nuU(i)

j=1

n

(TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . ,

j=1

112


113/125




i= 1, . . . , n



n

J k k

k(i) =

pij

(i)

g

i, (i), j

+Jk (j)

, i= 1, .

j=1

or Jk =TkJk


k+1 (i)arg min pij (u) g(i,u,j)+Jk(j) , i

uU(i)

j=1

or Tk+1Jk =T Jk

We discussed approximate versions of VI andPIusing projection and aggregation

We focused so far on cost functions and approx-imation.We now consider Q-factors.

113


114/125

BELLMAN EQUATIONS FOR Q-FACTORS

The optimal Q-factors are defined byn

Q(i, u) =

p ij (u)

g(i,u,j) + J (j)

, (i, u)j=1

Since J =T J, we have J(i) = minuU(i) Q(i, u)so the optimal Q-factors solve the equation

n

Q(i, u) = pij (u)g(i,u,j) + min Q(j,u u)U(j)j=1

Equivalently Q =F Q, where

n

(F Q)(i, u) =

p ij (u)

g(i,u,j) +

min Q(j, u)u U(j)

j=1

This is Bellmans Eq. for a system whose statesare the pairs (i, u)

Similar mapping F and Bellman equation fora policy : Q =FQ

114

States

(i, u)

j

pij u

g(i, u, j)

j

j, (j)

State-Control Pairs: Fixed Policy


115/125

SUMMARY OF BELLMAN EQS FOR Q-FACTOR

Optimal Q-factors:For all (i, u)

n

Q(i, u) =

pij (u)

g(i,u,j) +

min Q(j, u)u U(j)

j=1

EquivalentlyQ =F Q, wheren

(F Q)(i, u) =

pij (u)

g(i,u,j) + min Q(j, u)u U(j)j=1

Q-factors of a policy:For all (i, u)

n

Q(i, u) = pij (u)

g(i,u,j) +Q (=1

j, j)j

EquivalentlyQ =FQ, wheren

(FQ)(i, u) = pij (u) g(i,u,j) +Q j, (j)j=1

115

States

(i, u)

j

pij ug(i, u, j)

j

j, (j)

State-Control Pairs: Fixed Policy


116/125

WHAT IS GOOD AND BAD ABOUT Q-FACTOR

All the exact theory and algorithms for costsapplies to Q-factors

Bellmans equations, contractions, optimal-ity conditions, convergence of VI and PI

All the approximate theory and algorithms forcosts applies to Q-factors

Projected equations, sampling and exploration

issues, oscillations, aggregation

A MODEL-FREE (on-line) controller imple-mentation

Once we calculate Q(i, u) for all (i, u),

(i) = arg min Q(i, u), iuU(i) Similarly, once we calculate a parametric ap-

proximation Q(i,u,r) for all (i, u),

(i) = arg min Q(i,u,r), iuU(i)

The main bad thing: Greater dimension andmore storage! [Can be used for large-scale prob-lems only through aggregation, or other cost func-tion approximation.]

116


117/125

Q-LEARNING

In addition to the approximate PI methodsadapted for Q-factors, there is an important addi-tional algorithm:

Q-learning,which can be viewed as a sam-pled form of VI

Q-learning algorithm (in its classical form): Sampling: Select sequence of pairs (ik, uk)(use any probabilistic mechanism for this,but all pairs (i, u) are chosen infinitely of-ten.)

Iteration:For eachk, selectjk according to

pikj (uk). Update just Q(ik, uk):

Qk+1(ik,uk) = (1 k)Qk(ik, uk)

+k

g(ik, uk, jk) + min

uQk(jk, u

U(jk)

Leave unchanged all other Q-factors: Qk+1(i, u) Qk(i, u) for all (i, u) = (ik, uk).

Stepsize conditions: k must converge to 0at proper rate (e.g., like 1/k).

117


118/125

NOTES AND QUESTIONS ABOUT Q-LEARNIN

Qk+1(ik,uk) = (1 k)Qk(ik, uk)

+k

g(ik, uk, jk) + Q k

min k(j , u)u U(jk)

Model free implementation. We just need asimulator that given (i, u) produces next state jand cost g(i,u,j)

Operates on only one state-control pair at atime.Convenient for simulation, no restrictions onsampling method.

Aims to find the (exactly) optimal Q-factors. Why does it converge toQ?

Why cant I use a similar algorithm for optimalcosts?

Important mathematical (fine) point:In theQ-

factor version of Bellmans equation the order ofexpectation and minimization is reversedrelativeto the cost version of Bellmans equation:

n

J(i) = min piuU(i)

j=1

j (u) g(i,u,j) +J(j) 118


119/125

CONVERGENCE ASPECTS OF Q-LEARNING

Q-learning can be shown to converge to true/exactQ-factors (under mild assumptions).

Proof is sophisticated, based on theories ofstochastic approximation and asynchronous algo-rithms.

Uses the fact that the Q-learning mapF:

(F Q)(i, u) =Ej g(i,u,j) + minu

Q(j, u)

is a sup-norm contr

action.

Generic stochastic approximation algorithm:

Consider generic fixed point problem involv-ing an expectation:

x=Ew f(x, w)

Assume Ew )

f(x, w

is a contractionwith

respect to some norm

, so the iteration

xk+1 =Ew f(xk, w)

converges to the uniqu

e fixed po

int

Approximate Ew f(x, w) by sampling 119


120/125

STOCH. APPROX. CONVERGENCE IDEAS

For each k, obtain samples {w1, . . . , wk} anduse the approximation

k1

xk+1 =

f(xk, wt)k

t=1

E f(xk, w)

This iteration approximates the

convergen

t fixedpoint iteration xk+1 =Ew

f(xk, w)

A major flaw:it requires, for each

k, the compu-

tation off(xk, wt) forallvalues wt, t= 1, . . . , k.

This motivates the more convenient iteration

k1xk+1 =

f(xt, wt), k= 1, 2, . . . ,

kt=1

that is similar, but requires much less computa-tion; it needsonly onevalue offper sample wt.

By denoting k = 1/k, it can also be written as

xk+1= (1 k)xk+kf(xk, wk), k= 1, 2, . . . Compare with Q-learning,where the fixed pointproblem is Q=F Q

(F Q)(i, u) =Ej g(i,u,j) + min Q(j, u

)

u

120


121/125

Q-FACTOR APROXIMATIONS

We introduce basis function approximation:Q(i,u,r) =(i, u)r

We can use approximate policy iteration andLSPE/LSTD for policy evaluation

Optimistic policy iteration methods are fre-quently used on a heuristic basis

Example: Generate trajectory {(ik, uk) | k =0, 1, . . .}.

At iteration k, given rkand state/control (ik, uk):

(1) Simulate next transition (ik, ik+1) using thetransition probabilities pikj (uk).

(2) Generate control uk+1 from

uk+1 = arg min Q(ik+1, u , rk)uU(ik+1)

(3) Update the parameter vector via

rk+1 =rk (LSPE or TD-like correction) Complex behavior, unclear validity (oscilla-tions, etc). There is solid basis for an important

special case: optimal stopping (see text)121


122/125

APPROXIMATION IN POLICY SPACE

We parameterize policies by a vector r =(r1, . . . , rs)(an approximation architecture for poli-cies).

Each policy (r) =

(i; r) | i = 1, . . . , ndefines a cost vector J(r) (a function ofr).

We optimize some measure ofJ(r) over r. For example, use a random search, gradient, orother method to minimize over r

n

piJ(r)(i),i=1

where (p1, . . . , pn) is some probability distributionover the states.

An important special case:Introduce cost ap-proximation architectureV(i, r) that defines indi-

rectly the parameterization of the policies

n

(i; r) = arg min

pij (u) i, iuU )

j=1

g( u, j)+V(j, r)

(

,

i

Brings in features to approximation in policy

space 122


123/125

APPROXIMATION IN POLICY SPACE METHOD

Random search methods are straightforwardand have scored some impressive successes withchallenging problems (e.g., tetris).

Gradient-type methods (known as policy gra-dient methods) also have been worked on exten-sively.

They move along the gradient with respect tor of

npiJ(r)(i),

i=1

There are explicit gradient formulas which havebeen approximated by simulation

Policy gradient methods generally suffer by slowconvergence, local minima, and excessive simula-tion noise

123


124/125

FINAL WORDS AND COMPARISONS

There is no clear winneramong ADP methods There is interesting theoryin all types of meth-ods (which, however, does not provide ironcladperformance guarantees)

There are major flaws in all methods:

Oscillations and exploration issues in approx-imate PI with projected equations

Restrictions on the approximation architec-ture in approximate PI with aggregation

Flakiness of optimization in policy space ap-proximation

Yet these methods have impressive successesto show with enormously complex problems, forwhich there is no alternative methodology

There are also other competing ADP methods

(rollout is simple, often successful, and generallyreliable; approximate LP is worth considering)

Theoretical understanding is important andnontrivial

Practice is an art and a challenge to our cre-

ati

mit6 231f11 notes short

Documents