mit dynamic programming lecture slides

8/11/2019 MIT Dynamic Programming Lecture Slides

1/261

LECTURE SLIDES ON DYNAMIC PROGRAMMING

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2004

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:Dynamic Programming and Optimal Control:2nd edition, Vols. I and II, Athena Scientific,2001, by Dimitri P. Bertsekas; see

http://www.athenasc.com/dpbook.html

Last Updated: December 2004

The slides are copyrighted, but may be freelyreproduced and distributed for any noncom-mercial purpose.


2/261

6.231 DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

Problem Formulation

Examples The Basic Problem Significance of Feedback


3/261


4/261

BASIC STRUCTURE OF STOCHASTIC DP

Discrete-time systemxk+1 =fk(xk, uk, wk), k= 0, 1, . . . , N 1

k:Discrete time xk:State;summarizes past information that

is relevant for future optimization

uk: Control;decision to be selected at timekfrom a given set

wk: Random parameter(also called distur-

bance or noise depending on the context)

N: Horizonor number of times control isapplied

Cost function that is additive over time

E

gN(xN) +

N1k=0

gk(xk, uk, wk)


5/261

INVENTORY CONTROL EXAMPLE

InventorySystem

Stock Ordered at

Period k

Stock at Period k Stock at Period k + 1

Demand at Periodk

xk

wk

xk + 1= xk+uk -wk

ukCost of Period k

cuk+ r (xk +uk-wk)

Discrete-time system

xk+1 =fk(xk, uk, wk) =xk+ uk

wk

Cost function that is additive over time

E

gN(xN) +

N1k=0

gk(xk, uk, wk)

=EN1

k=0

cuk+ r(xk+ uk wk)

Optimizationover policies: Rules/functions uk =k(xk)that map states to controls


6/261

ADDITIONAL ASSUMPTIONS

The set of values that the control ukcan takedepend at most onxkand not on priorxoru

Probability distribution of wkdoes not dependon past valueswk1, . . . , w0, but may depend onxkanduk

Otherwise past values of wor xwould beuseful for future optimization

Sequence of events envisioned in periodk: xkoccurs according to

xk =fk1

xk1, uk1, wk1

ukis selected with knowledge ofxk, i.e.,

uk

U(xk)

wkis random and generated according to adistribution

Pwk(xk, uk)


7/261

DETERMINISTIC FINITE-STATE PROBLEMS

Scheduling example: Find optimal sequence ofoperations A, B, C, D

A must precede B, and C must precede D Given startup costSAand SC, and setup tran-sition costCmnfrom operationmto operationn

A

SA

C

SC

AB

CAB

ACCAC

CDA

CAD

ABC

CA

CCD CD

ACD

ACB

CAB

CAD

CBC

CCB

CCD

CAB

CCA

CDA

CCD

CBD

CDB

CBD

CDB

CAB

InitialState


8/261

STOCHASTIC FINITE-STATE PROBLEMS

Example: Find two-game chess match strategy Timidplay draws with prob. pd > 0and loseswith prob.1 pd. Boldplay wins with prob.pw jbetter than thecurrent path s --> j ?)

Is di+ a

ij< UPPER

?

(Does the path s --> i --> jhave a chance to be partof a shorter s --> t path ?)

YES

YES

INSERT

OPEN

Set dj = di+ aij


32/261

EXAMPLE

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node sA

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

2

3

4

5

6

7

8

9

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1 1 1 2, 7,10 2 2 3, 5, 7, 10 3 3 4, 5, 7, 10 4 4 5, 7, 10 43

5 5 6, 7, 10 43

6 6 7, 10 13

7 7 8, 10 13

8 8 9, 10 13

9 9 10 13

10 10 Empty 13

Note thatsome nodes never entered OPEN


33/261

LABEL CORRECTING METHODS

Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})

i j

REMOVE

Is di+ aij< dj ?

(Is the path s --> i --> jbetter than thecurrent path s --> j ?)

Is di+ aij< UPPER ?

(Does the path s --> i --> j

have a chance to be partof a shorter s --> t path ?)

YES

YES

INSERT

OPEN

Set dj = di+ aij


34/261


LECTURE 4

LECTURE OUTLINE

Label correcting methods for shortest paths

Variants of label correcting methods

Branch-and-bound as a shortest path algorithm


35/261

LABEL CORRECTING METHODS

Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})

i j

REMOVE

Is di+ aij< dj ?

(Is the path s --> i --> jbetter than thecurrent path s --> j ?)

Is di+ aij< UPPER ?

(Does the path s --> i --> j

have a chance to be partof a shorter s --> t path ?)

YES

YES

INSERT

OPEN

Set dj = di+ aij


36/261

VALIDITY OF LABEL CORRECTING METHODS

Proposition:If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-

tination.

Proof: (1) Each time a node j enters OPEN,its label is decreased and becomes equal to thelength of some path fromstoj

(2) The number of possible distinct path lengthsis finite, so the number of times a node can enter

OPEN is finite, and the algorithm terminates(3) Let (s, j1, j2, . . . , jk, t)be a shortest path andlet d be the shortest distance. If UPPER > d

at termination, UPPER will also be larger than the

length of all the paths (s, j1, . . . , jm), m= 1, . . . , k,

throughout the algorithm. Hence, node jk willnever enter the OPEN list with djk equal to theshortest distance from s tojk. Similarly nodejk1will never enter the OPEN list with djk1equal tothe shortest distance fromsto jk1. Continue to

j1to get a contradiction.


37/261

MAKING THE METHOD EFFICIENT

Reduce the value of UPPER as quickly as pos-sible

Try to discover good s tpaths early inthe course of the algorithm

Keep the number of reentries into OPEN low

Try to remove from OPEN nodes with smalllabel first.

Heuristic rationale: if di is small, then djwhen set to di+aijwill be accordingly small,so reentrance of jin the OPEN list is lesslikely.

Reduce the overhead for selecting the node tobe removed from OPEN

These objectives are often in conflict. They giverise to a large variety of distinct implementations Good practical strategies try to strike a compro-mise between low overhead and small label nodeselection.


38/261

NODE SELECTION METHODS

Depth-first search:Remove from the top ofOPEN and insert at the top of OPEN.

Has low memory storage properties (OPENis not too long). Reduces UPPER quickly.

Origin Nodes

Destination Node t

4

2

3

4 5

6

7 8 9

3

2

Best-first search (Djikstra):Remove fromOPEN a node with minimum value of label.

Interesting property: Each node will be in-serted in OPEN at most once.

Many implementations/approximations


39/261

ADVANCED INITIALIZATION

Instead of starting from di =for all i= s,start with

di =length of some path from stoi (ordi = )

OPEN= {i =t | di < }

Motivation: Get a small starting value of UP-PER.

No node with shortest distanceinitial valueof UPPER will enter OPEN

Good practical idea: Run a heuristic (or use common sense) to

get a good starting pathPfromstot

Use as UPPER the length of P, and as dithe path distances of all nodes ialongP

Very useful also in reoptimization, where wesolve the same problem with slightly different data


40/261

VARIANTS OF LABEL CORRECTING METHODS

If a lower bound hj of the true shortest dis-tance fromjto tis known, use the test

di+ aij + hj


41/261

BRANCH-AND-BOUND METHOD

Problem: Minimize f(x)over a finiteset offeasible solutionsX.

Idea of branch-and-bound: Partition the feasi-ble set into smaller subsets, and then calculatecertain bounds on the attainable cost within some

of the subsets to eliminate from further consider-ation other subsets.

Bounding Principle

Given two subsetsY1 XandY2 X, supposethat we have bounds

f1 min

xY1f(x), f2 min

xY2f(x).

Then, if f2 f1, the solutions in Y1may be dis-regarded since their cost cannot be smaller thanthe cost of the best solution inY2.

The B+B algorithm can be viewed as a la-bel correcting algorithm, where lower bounds de-fine the arc costs, and upper bounds are used to

strengthen the test for admission to OPEN.


42/261

SHORTEST PATH IMPLEMENTATION

Acyclic graph/partition of Xinto subsets (typi-cally a tree). The leafs consist of single solutions.

Upper/Lower bounds fY

and fYfor the mini-mum cost over each subsetYcan be calculated.

The lower bound of a leaf

{x}

isf(x)

Each arc(Y, Z)has lengthfZ f

Y

Shortest distance fromXtoY=fY f

X

Distance from origin Xto a leaf {x} is f(x)fX

Distance from origin Xto a leaf {x} is f(x)fX Shortest path from Xto the set of leafs givesthe optimal cost and optimal solution

UPPER is the smallest f(x)out of leaf nodes

{x}

examined so far {1,2,3,4,5}

{1,2,}

{4,5}{1,2,3}

{1} {2}

{3} {4} {5}


43/261

BRANCH-AND-BOUND ALGORITHM

Step 1:Remove a nodeYfrom OPEN. For eachchildYjofY, do the following: If fY j


44/261


LECTURE 5

LECTURE OUTLINE

Examples of stochastic DP problems

Linear-quadratic problems Inventory control


45/261

LINEAR-QUADRATIC PROBLEMS

System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost

Ewk

k=0,1,...,N1 xNQNxN+

N1

k=0(xkQkxk+ u

kRkuk)

where Qk 0 and Rk >0 (in the positive (semi)definsense).

wkare independent and zero mean DP algorithm:

JN(xN) =xNQNxN,

Jk(xk) = minuk

E

xkQkxk+ ukRkuk

+ Jk+1(Akxk+ Bkuk+ wk) Key facts: Jk(xk)is quadratic Optimal policy {0, . . . , N1} is linear:

k(xk) =Lkxk

Similar treatment of a number of variants


46/261

DERIVATION

By induction verify thatk(xk) =Lkxk, Jk(xk) =x

kKkxk+constant,

whereLkare matrices given by

Lk = (BkKk+1Bk+ Rk)1BkKk+1Ak,

and where Kkare symmetric positive semidefinitematrices given by

KN =QN,

Kk =Ak

Kk+1 Kk+1Bk(BkKk+1Bk+ Rk)1BkKk+1

Ak+ Qk.

This is called thediscrete-time Riccati equation. Just like DP, it starts at the terminal timeNandproceeds backwards.

Certainty equivalence holds (optimal policy isthe same as whenwkis replaced by its expected

valueE{wk} = 0).


47/261

ASYMPTOTIC BEHAVIOR OF RICCATI EQUATION

Assume time-independent system and cost perstage, and some technical assumptions: contro-lability of(A, B)and observability of(A, C)whereQ=CC

The Riccati equation converges limk Kk =

K, where Kis pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of thealgebraic Riccati equation

K=AK KB(BKB+ R)1BKA + Q

The corresponding steady-state controller (x) =Lx, where

L= (BKB+ R)1BKA,

is stable in the sense that the matrix(A + BL)ofthe closed-loop system

xk+1 = (A + BL)xk+ wk

satisfieslimk A + BL k = 0.


48/261

GRAPHICAL PROOF FOR SCALAR SYSTEMS

A

2

RB

2 + Q

P 0

Q

F(P)

450

PPk Pk + 1 P*

-R

B

2

Riccati equation (withPk =KNk):

Pk+1 =A2

Pk B2P2k

B2Pk+ R

+ Q,

orPk+1 =F(Pk),where

F(P) = A2RP

B2P+ R+ Q.

Note the two steady-state solutions, satisfying

P =F(P), of which only one is positive.


49/261

RANDOM SYSTEM MATRICES

Suppose that {A0, B0}, . . . , {AN1, BN1} arenot known but rather are independent random ma-trices that are also independent of thewk

DP algorithm is

JN(xN) =xNQNxN,

Jk(xk) = minuk

Ewk,Ak,Bk

xkQkxk

+ ukRkuk+ Jk+1(Akxk+ Bkuk+ wk) Optimal policyk(xk) =Lkxk,where

Lk =

Rk+ E{BkKk+1Bk}

1

E{BkKk+1Ak},

and where the matricesKkare given by

KN =QN,

Kk =E{AkKk+1Ak} E{AkKk+1Bk}

Rk+ E{BkKk+1Bk

}1

E

{BkKk+1Ak

}+ Q


50/261

PROPERTIES

Certainty equivalence may not hold Riccati equation may not converge to a steady-state

Q

450

0 P

F (P)

- R

E{B2}

We havePk+1 = F(Pk),where

F(P) = E{A2}RPE{B2}P+ R + Q +

T P2

E{B2}P+ R ,

T =E{A2}E{B2} E{A}2

E{B}2


51/261

INVENTORY CONTROL

xk: stock, uk: inventory purchased, wk: de-mand

xk+1 =xk+ uk wk, k= 0, 1, . . . , N 1

Minimize

E

N1k=0

cuk+ r(xk+ uk wk)

where, for somep >0andh >0,

r(x) =p max(0, x) + h max(0, x)

DP algorithm:

JN(xN) = 0,

Jk(xk) = minuk0

cuk+H(xk+uk)+E

Jk+1(xk+ukwk)

whereH(x + u) =E{r(x + u w)}.


52/261

OPTIMAL POLICY

DP algorithm can be written asJN(xN) = 0,

Jk(xk) = minuk0

Gk(xk+ uk) cxk,

where

Gk(y) =cy+ H(y) + E

Jk+1(y w)

.

If Gkis convex and lim|x| Gk(x) , wehave

k(xk) =

Sk xk ifxk < Sk,0 ifxk Sk,

whereSkminimizesGk(y).

This is shown, assuming thatc < p, by showingthatJkis convex for allk, and

lim|x|

Jk(x)


53/261

JUSTIFICATION

Graphical inductive proof thatJkis convex.

- cy

- cy

y

H(y)

cy+ H(y)

SN - 1

cSN - 1

JN - 1(xN - 1)

xN - 1SN - 1


54/261


LECTURE 6

LECTURE OUTLINE

Stopping problems

Scheduling problems Other applications


55/261

PURE STOPPING PROBLEMS

Two possible controls: Stop (incur a one-time stopping cost, and

move to cost-free and absorbing stop state)

Continue [using xk+1 = fk(xk, wk)and in-curring the cost-per-stage]

Each policy consists of apartitionof the set ofstatesxkinto two regions:

Stop region, where we stop Continue region, where we continue

STOPREGION

CONTINUEREGION

Stop State


56/261

EXAMPLE: ASSET SELLING

A person has an asset, and at k= 0, 1, . . . , N 1receives a random offerwk

May accept wkand invest the money at fixedrate of interest r, or reject wkand wait for wk+1.Must accept the last offerwN1

DP algorithm (xk: current offer,T: stop state):

JN(xN) =

xN ifxN=T,0 ifxN =T,

Jk(xk) =max(1 + r)Nkxk, EJk+1(wk) ifxk =T

0 ifxk =T

Optimal policy;

accept the offerxk ifxk > k,

reject the offerxk ifxk < k,

where

k = E

Jk+1(wk)

(1 + r)Nk .


57/261

FURTHER ANALYSIS

0 1 2 N- 1 N k

ACCEPT

REJECT

1

N - 1

2

Can show thatk k+1for allkProof: Let Vk(xk) =Jk(xk)/(1 + r)Nk for xk=

T.Then the DP algorithm is

VN(xN) =xNand

Vk(xk) = max

xk, (1 + r)1 E

w

Vk+1(w)

.

We have k =EwVk+1(w)/(1 + r), so it is enoughto show thatVk(x) Vk+1(x)for allxandk. Startwith VN1(x) VN(x)and use the monotonicityproperty of DP.

We can also show thatk aas k .Suggests that for an infinite horizon the optimal

policy is stationary.


58/261

GENERAL STOPPING PROBLEMS

At timek, we may stop at costt(xk)or choosea controluk U(xk)and continue

JN(xN) =t(xN),

Jk(xk) = mint(xk), minukU(xk) Eg(xk, uk, wk)+ Jk+1

f(xk, uk, wk)

Optimal to stop at timekfor statesxin the set

Tk = x t(x) minuU(x) Eg(x,u,w) + Jk+1f(x,u,w) Since JN1(x) JN(x), we have Jk(x) Jk+1(x)for allk, so

T0

Tk

Tk+1

TN1

.

Interesting case is when all theTkare equal (toTN1, the set where it is better to stop than to goone step and stop). Can be shown to be true if

f(x,u,w) TN1, for allx TN1, u U(x),


59/261

SCHEDULING PROBLEMS

Set of tasks to perform, the ordering is subjectto optimal choice.

Costs depend on the orderThere may be stochastic uncertainty, and prece-dence and resource availability constraints

Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehicle

routing, etc.)

Some special problems admit a simple quasi-

analytical solution method Optimal policy has an index form, i.e., each

task has an easily calculable index, andit is optimal to select the task that has themaximum value of index (multi-armed bandit

problems - to be discussed later) Some problems can be solved by aninter-

change argument(start with some sched-ule, interchange two adjacent tasks, and see

what happens)


60/261

EXAMPLE: THE QUIZ PROBLEM

Given a list ofNquestions. If questioniis an-swered correctly (given probabilitypi), we receiverewardRi; if not the quiz terminates. Choose or-der of questions to maximize expected reward.

Let iand jbe the kth and(k+ 1)st questions

in an optimally ordered list

L= (i0, . . . , ik1, i , j , ik+2, . . . , iN1)

E{reward ofL} =E

reward of{i0, . . . , ik1}+pi0 pik1 (piRi+pipjRj)+pi0 pik1pipjE

reward of {ik+2, . . . , iN1}

Consider the list withiandjinterchanged

L = (i0, . . . , ik1, j , i , ik+2, . . . , iN1)

Since L is optimal, E{reward ofL} E{reward ofLso it follows thatpiRi +pipjRj pjRj+pjpiRior

piRi/(1 pi) pjRj/(1 pj).


61/261

MINIMAX CONTROL

Consider basic problem with the difference thatthe disturbancewkinstead of being random, it is

just known to belong to a given setWk(xk, uk).

Find policythat minimizes the cost

J(x0) = maxwkWk(xk,k(xk))

k=0,1,...,N1

gN(xN)+

N1

k=0gk

xk, k(xk), w

The DP algorithm takes the form

JN(xN) =gN(xN),

Jk(xk) = minukU(xk) maxwkWk(xk,uk)gk(xk, uk, wk)

+ Jk+1

fk(xk, uk, wk)

(Exercise 1.5 in the text, solution posted on thewww).


62/261

UNKNOWN-BUT-BOUNDED CONTROL

For each k, keep the xkof the controlled systemxk+1 =fk

xk, k(xk), wk

inside a given setXk, thetarget set at timek.

This is a minimax control problem, where thecost at stagekis

gk(xk) =

0 ifxk Xk,1 ifxk / Xk.

We must reach at timekthe setXk =

xk | Jk(xk) = 0

in order to be able to maintain the state within the

subsequent target sets.Start with XN =XN, and for k= 0, 1, . . . , N 1,

Xk =

xk Xk | there existsuk Uk(xk)such thfk(xk, uk, wk)

Xk+1, for allwk

Wk(xk, uk


63/261


LECTURE 7

LECTURE OUTLINE

Deterministic continuous-time optimal control

Examples Connection with the calculus of variations The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm

The Hamilton-Jacobi-Bellman equation as a suf-ficient condition

Examples


64/261

PROBLEM FORMULATION

We have a continuous-time dynamic systemx(t) =f

x(t), u(t)

, 0 t T, x(0) : given,

where

x(t)

n is the state vector at timet

u(t)U m is the control vector at timet,Uis the control constraint set

Tis the terminal time.Any admissible control trajectory u(t) | t [0, T](piecewise continuous function u(t) | t [0, T]with u(t) U for all t [0, T]), uniquely deter-mines

x(t) | t [0, T].

Find an admissible control trajectory

u(t) | t

[0, T] and corresponding state trajectory x(t) | t [0, T], that minimizes a cost function of the formh

x(T)

+

T0

g

x(t), u(t)

dt

f ,h,gare assumed continuously differentiable.


65/261

EXAMPLE I

Motion control: A unit mass moves on a lineunder the influence of a forceu.

x(t) = x1(t), x2(t): position and velocity ofthe mass at timet

Problem: From a given x1(0), x2(0), bringthe mass near a given final position-velocity pair

(x1, x2)at timeTin the sense:

minimize

x1(T) x1

2

+

x2(T) x2

2

subject to the control constraint

|u(t)| 1, for allt [0, T].

The problem fits the framework with

x1(t) =x2(t), x2(t) =u(t),

h

x(T)

=

x1(T) x1

2

+

x2(T) x2

2

,

gx(t), u(t)= 0, for allt [0, T].


66/261

EXAMPLE II

A producer with production rate x(t)at time tmay allocate a portion u(t)of his/her productionrate to reinvestment and1 u(t)to production ofa storable good. Thusx(t)evolves according to

x(t) =u(t)x(t),

where >0is a given constant.

The producerwants tomaximize the total amountof product stored

T0

1 u(t)x(t)dt

subject to

0 u(t) 1, for allt [0, T].

The initial production rate x(0) is a given positivenumber.


67/261

EXAMPLE III (CALCULUS OF VARIATIONS)

Length =0T

1 + (u(t))2 dt

x(t)

T t0

x(t) =u(t).

Given

Point Given

Line

Find a curve from a given point to a given linethat has minimum length.

The problem is

minimize

T0

1 +

x(t)

2dt

subject to x(0) =.

Reformulation as an optimal control problem:

minimize

T0

1 +

u(t)

2dt

subject to x(t) =u(t), x(0) =.


68/261

HAMILTON-JACOBI-BELLMAN EQUATION I

We discretize [0, T] at times 0, , 2 , . . . , N ,where=T /N, and we let

xk =x(k), uk =u(k), k= 0, 1, . . . , N .

We also discretize the system and cost:

xk+1 =xk+f(xk, uk), h(xN)+N1k=0

g(xk, uk).

We write the DP algorithm for the discretizedproblem

J(N,x) =h(x),

J(k,x) = minuUg(x, u)+

J(k+1), x+f(x, u) Assume J is differentiable and Taylor-expand:

J(k,x) = min

uUg(x, u) +J(k,x) +tJ

(k,x)

+xJ

(k,x)

f(x, u) + o().


69/261

HAMILTON-JACOBI-BELLMAN EQUATION II

Let J(t, x) be the optimal cost-to-go of the con-tinuous problem. Assuming the limit is valid

limk, 0, k=t

J(k,x) =J(t, x), for allt, x,

we obtainfor allt, x,0 = min

uU

g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)

with the boundary conditionJ(T, x) =h(x).

This is the Hamilton-Jacobi-Bellman (HJB) equa-tion apartialdifferential equation, which is sat-isfied for all time-state pairs (t, x) by the cost-to-gofunction J(t, x) (assuming J is differentiable andthe preceding informal limiting procedure is valid).

It is hard to tell a prioriif J

(t, x) is differentiable. So we use the HJB Eq. as a verification tool; ifwe can solve it for a differentiableJ(t, x), then:

J is the optimal-cost-to-go function

The control (t, x) that minimizes in the RHS

for each(t, x)defines an optimal control


70/261

VERIFICATION/SUFFICIENCY THEOREM

SupposeV(t, x)is a solution to the HJB equa-tion; that is, Vis continuously differentiable in tandx, and is such that for allt, x,

0 = minuUg(x, u) + tV(t, x) + xV(t, x)

f(x, u),V(T, x) =h(x), for allx.

Suppose also that(t, x)attains the minimumabove for alltandx.

Let x(t) | t [0, T]andu(t) =t, x(t),t [0, T], be the corresponding state and controltrajectories.

Then

V(t, x) =J(t, x), for allt, x,

and

u(t) | t [0, T]is optimal.


71/261

PROOF

Let

{(u(t),x(t))

|t

[0, T]

}be any admissible contro

state trajectory. We have for allt [0, T]0g

x(t), u(t)

+tV

t, x(t)

+xV

t, x(t)

f

x(t), u(t)

Using the system equation x(t) = f

x(t),u(t)

,

the RHS of the above is equal tog

x(t),u(t)

+ d

dt

V(t,x(t))

Integrating this expression overt [0, T],

0 T0

gx(t),u(t)dt + VT,x(T)V0,x(0).UsingV(T, x) =h(x)and x(0) =x(0), we have

V0, x(0) hx(T) + T

0

gx(t),u(t)dt.If we useu(t)andx(t)in place of u(t)andx(t),the inequalities becomes equalities, and

V0, x(0)=hx(T) + T

0

gx(t), u(t)dt.


72/261

EXAMPLE OF THE HJB EQUATION

Consider the scalar systemx(t) =u(t), with |u(t)| 1and cost(1/2)x(T)2.The HJB equation is0 = min

|u|1

tV(t, x) + xV(t, x)u, for allt, x,with the terminal conditionV(T, x) = (1/2)x2. Evident candidate for optimality: (t, x) =sgn(x). Corresponding cost-to-go

J(t, x) = 1

2max0,|x| (T t)2

.

We verify thatJ solves the HJB Eq., and thatu= sgn(x)attains the min in the RHS. Indeed,

tJ(t, x) = max0, |x| (T t)

,

xJ(t, x) =sgn(x) max0,|x| (T t).Substituting, the HJB Eq. becomes

0 = min|u|11 +sgn(x) umax0, |x| (T t)


73/261

LINEAR QUADRATIC PROBLEM

Consider then-dimensional linear systemx(t) =Ax(t) + Bu(t),

and the quadratic cost

x(T)

QTx(T) + T

0x(t)Qx(t) + u(t)Ru(t)dt

The HJB equation is

0 = minum x

Qx+uRu+tV(t, x)+xV(t, x)

(Ax+Bu

with the terminal conditionV(T, x) = xQTx.Wetry a solution of the form

V(t, x) =xK(t)x, K(t) :n

nsymmetric,

and show thatV(t, x)solves the HJB equation if

K(t) = K(t)AAK(t)+K(t)BR1BK(t)Q

with the terminal conditionK(T) =QT.


74/261


LECTURE 8

LECTURE OUTLINE


From the HJB equation to the Pontryagin Mini-mum Principle

Examples


75/261

THE HJB EQUATION

Continuous-time dynamic systemx(t) =f

x(t), u(t)

, 0 t T, x(0) :given

Cost function

h

x(T)

+ T0

g

x(t), u(t)

dt

J(t, x): optimal cost-to-go fromxat timet

HJB equation:For all(t, x)

0 = minuU



Verification theorem: If we can find a solution, it

must be equal to the optimal cost-to-go function.

Also a (closed-loop) policy(t, x)such that

(t, x)attains the min for each(t, x)

is optimal.


76/261

HJB EQ. ALONG AN OPTIMAL TRAJECTORY

Observation I: An optimal control-state trajec-tory pair {(u(t), x(t)) | t [0, T]satisfies for allt [0, T]

u(t) = arg min

uUg

x(t), u

+xJ

t, x

(t)

f

x(t), u

() Observation II: To obtain an optimal control tra-

jectory{u(t) | t [0, T]via this equation, wedont need to knowxJ(t, x)forall(t, x)- onlythe time function

p(t) = xJt, x(t), t [0, T]. It turns out that calculatingp(t)is often easierthan calculatingJ(t, x)or xJ(t, x)for all(t, x).

Pontryagins minimum principle is just Eq. () to-gether with an equation for calculatingp(t), calledtheadjointequation.

Also, Pontryagins minimum principle is validmuch more generally, even in cases where J(t, x)

is not differentiable and the HJB has no solution.


77/261

DERIVING THE ADJOINT EQUATION

The HJB equation holds as an identity for all(t, x), so it can be differentiated [the gradient ofthe RHS with respect to(t, x)is identically 0].

We need a tool for differentiation of minimumfunctions.

Lemma: LetF(t,x,u)be a continuously differen-tiable function of t , x n, and u m,and let Ube a convex subset ofm. Assumethat (t, x)is a continuously differentiable func-tion such that

(t, x) = arg minuU

F(t,x,u), for allt, x.

Then

tminuU

F(t,x,u)

= tFt,x,(t, x), for allt,x

minuU

F(t,x,u)

= xF

t,x,(t, x)

, for allt,


78/261

DIFFERENTIATING THE HJB EQUATION I

We set to zero the gradient with respect to xandtof the function

g

x, (t, x)

+tJ(t, x)+xJ

t, x

f

x, (t, x)

and we rely on the Lemma to disregard the termsinvolving the derivatives of(t, x)with respect totandx.

We obtain for all(t, x),

0 =xgx, (t, x)+2xtJ(t, x)+2xxJ

(t, x)f

x, (t, x)

+xf

x,

(t, x)

xJ(

0 = 2ttJ(t, x) + 2xtJ(t, x)f

x, (t, x)

,

where xfx, (t, x)is the matrixxf=

f1x1

fnx1

......

...f1xn

fnxn


79/261

DIFFERENTIATING THE HJB EQUATION II

The preceding equations hold for all(t, x). Wespecialize them along an optimal state and con-trol trajectory

x(t), u(t)

| t [0, T], whereu(t) =

t, x(t)

for allt [0, T].

We have x(t) =fx(t), u(t),so the terms2xtJ

t, x(t)

+ 2xxJ

t, x(t)

f

x(t), u(t)

2ttJ

t, x(t)

+ 2xtJ

t, x(t)

f

x(t), u(t)

are equal to the total derivativesd

dt

xJt, x(t), ddt

tJt, x(t),and we have

0 =xg

x, u(t)

+

d

dt

xJ

t, x(t)

+xf

x, u

(t)

xJ

t, x(t)

0 =

d

dttJt, x(t).


80/261

CONCLUSION FROM DIFFERENTIATING THE HJB

Definep(t) = xJ

t, x(t)

and

p0(t) = tJ

t, x(t)

We have theadjoint equationp(t) = xf

x(t), u(t)

p(t)xg

x(t), u(t)

and

p0(t) = 0

or equivalently,

p0(t) =constant, for allt [0, T].

Note also that, by definition JT, x(T) =h

x(T)

, so we have the following boundary con-dition at the terminal time:

p(T) =

hx(T)


81/261

NOTATIONAL SIMPLIFICATION

Define theHamiltonianfunctionH(x,u,p) =g(x, u) +pf(x, u)

The adjoint equation becomes

p(t) = xH

x(t), u(t), p(t)

The HJB equation becomes

0 = minuU

Hx(t), u , p(t) +p0(t)=H

x(t), u(t), p(t)

+p0(t)

so since p0(t) =constant, there is a constant Csuch that

H

x(t), u(t), p(t)

=C, for allt [0, T].


82/261

PONTRYAGIN MINIMUM PRINCIPLE

The preceding (highly informal) derivation issummarized as follows:

Minimum Principle:Let

u(t) | t [0, T]bean optimal control trajectory and let

x(t) | t

[0, T]be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equationp(t) = xH

x(t), u(t), p(t)

,

with the boundary condition

p(T) = hx(T).Then, for allt [0, T],

u(t) = arg minuU

Hx(t), u , p(t).

Furthermore, there is a constantCsuch that

H

x(t), u(t), p(t)

=C, for allt [0, T].


83/261

2-POINT BOUNDARY PROBLEM VIEW

The minimum principle is a necessary conditionfor optimalityand can be used to identify candi-dates for optimality.

We need to solve forx(t)andp(t)the differen-tial equations

x(t) =f

x(t), u(t)

p(t) = xH

x(t), u(t), p(t)

,

with split boundary conditions:

x(0) :given, p(T) = hx(T). The control trajectory is implicitly determinedfromx(t)andp(t)via the equation

u(t) = arg minuU

H

x(t), u , p(t)

.

This 2-point boundary value problem can be

addressed with a variety of numerical methods.


84/261

ANALYTICAL EXAMPLE I

minimize T0

1 +

u(t)

2dt

subject to

x(t) =u(t), x(0) =.

Hamiltonian is

H(x,u,p) =

1 + u2 +pu,

and adjoint equation is p(t) = 0withp(T) = 0.

Hence,p(t) = 0 for all t [0, T], so minimizationof the Hamiltonian gives

u(t) = arg minu

1 + u2 = 0, for allt [0, T].

Therefore, x(t) = 0for allt, implying thatx(t)isconstant. Using the initial condition x(0) = , itfollows thatx(t) =for allt.


85/261

ANALYTICAL EXAMPLE II

Optimal production problem

maximize

T0

1 u(t)x(t)dt

subject to0 u(t) 1for allt, andx(t) =u(t)x(t), x(0)>0 :given.

Hamiltonian: H(x,u,p) = (1 u)x +pux.

Adjoint equation isp(t) = u(t)p(t) 1 + u(t), p(T) = 0.

Maximization of the Hamiltonian overu [0, 1]:

u(t) = 0 ifp(t)< 1,

1 ifp(t) 1

.

Since p(T) = 0, for tclose to T, p(t) < 1/andu(t) = 0. Therefore, for t near Tthe adjoint equa-

tion has the form p(t) = 1.


86/261

ANALYTICAL EXAMPLE II (CONTINUED)

T t0

p(t)

T - 1/

1/

Fort = T 1/, p(t)is equal to 1/, sou(t)changes tou(t) = 1.

Geometrical construction

T t0

p(t)

T - 1/

1/

T t0 T - 1/

u*(t)

u*(t) = 1 u*(t) = 0


87/261


LECTURE 9

LECTURE OUTLINE


Variants of the Pontryagin Minimum Principle Fixed terminal state Free terminal time

Examples

Discrete-Time Minimum Principle


88/261

REVIEW

Continuous-time dynamic systemx(t) =f

x(t), u(t)

, 0 t T, x(0) :given

Cost function

h

x(T)

+ T0

g

x(t), u(t)

dt

J(t, x): optimal cost-to-go fromxat timet

HJB equation/Verification theorem: For all (t, x)

0 = minuU



Adjoint equation/vector: To compute an op-

timal state-control trajectory{(u(t), x(t))it isenough to know

p(t) = xJ

t, x(t)

, t [0, T].

Pontryagin theorem gives an equation forp(t).


89/261

NEC. CONDITION: PONTRYAGIN MIN. PRINCIPLE

Define the Hamiltonian functionH(x,u,p) =g(x, u) +pf(x, u).

Minimum Principle:Let u(t) | t [0, T]be an optimal control trajectory and let

x(t) | t

[0, T]

be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equation

p(t) = xHx(t), u(t), p(t),with the boundary condition

p(T) = h

x(T)

.

Then, for allt [0, T],u(t) = arg min

uUH

x(t), u , p(t)

.

Furthermore, there is a constantCsuch that

H x t u t t = C for all t 0 T .


90/261

VARIATIONS: FIXED TERMINAL STATE

Suppose that in addition to the initial statex(0),the final statex(T)is given.

Then the informal derivation of the adjoint equa-tionstill holds, but the terminal condition J(T, x) h(x)of the HJB equation is not true anymore.

In effect,

J(T, x) =

0 ifx=x(T) otherwise.

SoJ

(T, x)cannot be differentiated with respecttox, and the terminal boundary conditionp(T) =hx(T) for the adjoint equation does not hold. As compensation, we have the extra condition

x(T) :given,

thus maintaining the balance between boundaryconditions and unknowns.

Generalization: Some components of the ter-minal state are fixed.


91/261

EXAMPLE WITH FIXED TERMINAL STATE

Consider finding the curve of minimum lengthconnecting two points(0, )and(T, ). We have

x(t) =u(t), x(0) =, x(T) =,

and the cost is T0 1 + u(t)2 dt.

T t0

x*(t)

The adjoint equation is p(t) = 0,implying that

p(t) =constant, for allt

[0, T].

Minimizing the Hamiltonian 1 + u2 +p(t)u:

u(t) =constant, for allt [0, T].

So optimal x(t) | t [0, T]is a straight line.


92/261

VARIATIONS: FREE TERMINAL TIME

Initial state and/or the terminal state are given,but the terminal timeTis subject to optimization.

Let x(t), u(t) | t [0, T]be an optimalstate-control trajectory pair and letT be the opti-mal terminal time. Thenx(t), u(t)would still be

optimal ifTwere fixed atT, so

u(t) = arg minuU

H

x(t), u , p(t)

, for all t [0, T

wherep(t)is given by the adjoint equation.

In addition: H(x(t), u(t), p(t) ) = 0for all t[instead ofH(x(t), u(t), p(t)) constant]. Justification: We have

tJt, x(t)t=0 = 0

Along the optimal, the HJB equation is

tJ

t, x(t)

= Hx(t), u(t), p(t), for alltsoHx(0), u(0), p(0)= 0.


93/261

MINIMUM-TIME EXAMPLE I

Unit mass moves horizontally: y(t) = u(t),wherey(t): position,u(t): force,u(t) [1, 1]. Given the initial position-velocity (y(0), y(0)),bring the object to (y(T), y(T)) = (0, 0)so thatthe time of transfer is minimum. Thus, we want to

minimizeT =

T0

1dt.

Let the state variables be

x1(t) =y(t), x2(t) = y(t),

so the system equation is

x1(t) =x2(t), x2(t) =u(t).

Initial state x1(0), x2(0): given andx1(T) = 0, x2(T) = 0.


94/261

MINIMUM-TIME EXAMPLE II

If u(t) | t [0, T]is optimal,u(t)must min-imize the Hamiltonian for eacht, i.e.,

u(t) = arg min1u1

1 +p1(t)x2(t) +p2(t)u

.

Therefore

u(t) =

1 ifp2(t)


95/261


96/261

MINIMUM-TIME EXAMPLE IV

For intervals whereu(t) 1, the system movesalong the curves

x1(t) 12

x2(t)

2

: constant.

For intervals where u(t) 1, the systemmoves along the curves

x1(t) +1

2x2(t)2

: constant.

x1

x2

u(t)1

0

(a)

x

x2

0

u(t)-1

(b)


97/261

MINIMUM-TIME EXAMPLE V

To bring the system from the initial state x(0)to the origin with at most one switch, we use thefollowing switching curve.

x1

x2

u*(t)1

u*(t) -1

0

(x1(0),x2(0))

(a) If the initial state lies abovethe switching curve,use u(t) 1 until the state hits the switch-ing curve; then useu(t) 1.

(b) If the initial state lies belowthe switching curve,useu(t)1until the state hits the switch-ing curve; then useu(t) 1.

(c) If the initial state lies on the top (bottom)

part of the switching curve, use u(t) 1[u(t)

1, respectively].


98/261

DISCRETE-TIME MINIMUM PRINCIPLE

Minimize J(u) = gN(xN) +N1k=0 gk(xk, uk),subject touk Uk m, withUk: convex, and

xk+1 =fk(xk, uk), k= 0, . . . , N 1, x0 : given.

Introduce Hamiltonian function

Hk(xk, uk, pk+1) =gk(xk, uk) +pk+1fk(xk, uk)

Suppose{(uk, xk+1) | k = 0, . . . , N 1}areoptimal. Then for allk,

ukHkxk, uk, pk+1(ukuk) 0, for alluk Ukwherep1, . . . , pNare obtained from

pk = xkfk pk+1+ xkgk,

with the terminal conditionpN = gN(xN). If, in addition, the HamiltonianHkis a convexfunction ofukfor any fixedxkandpk+1, we have

uk

= arg minukUk

Hkxk, uk, pk+1, for allk.


99/261

DERIVATION

We develop an expression for the gradientJ(u).We have, using the chain rule,

ukJ(u) =ukfk xk+1 fk+1 xN1 fN1 gN

+ukfk xk+1 fk+1 xN2 fN2 xN1 gN

+ukfk xk+1 gk+1

+ukgk,

where all gradients are evaluated along u and thecorresponding state trajectory.

Iintroduce the discrete-time adjoint equation

pk = xkfk pk+1+ xkgk, k= 1, . . . , N 1,

with terminal conditionpN = gN. Verify that, for allk,

ukJ(u0, . . . , uN1) = ukHk(xk, uk, pk+1)


100/261


LECTURE 10

LECTURE OUTLINE

Problems with imperfect state info

Reduction to the perfect state info case Machine repair example


101/261

BASIC PROBLEM WITH IMPERFECT STATE INFO

Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowingxk,receives at each time kan observation of the form

z0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1

The observationzkbelongs to some spaceZk.The random observation disturbance vkis char-acterized by a probability distribution

Pvk ( | xk, . . . , x0, uk1, . . . , u0, wk1, . . . , w0, vk1, . . . , v0)

The initial state x0is also random and charac-terized by a probability distributionPx0 .

The probability distributionPwk( | xk, uk)ofwkis given, and it may depend explicitly on xkandukbut not onw0, . . . , wk1, v0, . . . , vk1.

The controlukis constrained to a given subsetUk(this subset does not depend on xk, which isnot assumed known).


102/261

INFORMATION VECTOR AND POLICIES

Denote by Ikthe information vector, i.e., theinformation available at timek:

Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k 1,I0 =z0.

We consider policies ={0, 1, . . . , N1},where each function kmaps the information vec-torIkinto a controlukand

k(Ik) Uk, for allIk, k 0.

We want to find a policythat minimizes

J = Ex0,wk,vk

k=0,...,N1

gN(xN) +

N1k=0

gk

xk, k(Ik), wk

subject to the equations

xk+1 =fk

xk, k(Ik), wk

, k 0,

z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1


103/261

EXAMPLE: MULTIACCESS COMMUNICATION I

Collection of transmitting stations sharing a com-mon channel, are synchronized to transmit pack-ets of data at integer times.

xk: backlog at the beginning of slot k.

a

k: random number of packet arrivals in slotk.

tk: the number of packets transmitted in slotk.

xk+1 =xk+ ak tk,

At kth slot, each of the xkpackets in the systemis transmitted with probability uk(common for allpackets). If two or more packets are transmitted

simultaneously, they collide.

Sotk = 1(a success) with probabilityxkuk(1

uk)xk1, andtk = 0(idle or collision) otherwise. Imperfect state info: The stations can observethe channel and determine whether in any oneslot there was a collision (two or more packets), a

success (one packet), or an idle (no packets).


104/261

EXAMPLE: MULTIACCESS COMMUNICATION II

Information vector at timek: The entire history(up to k) of successes, idles, and collisions (aswell as u0, u1, . . . , uk1). Mathematically, zk+1,the observation at the end of thekth slot, is

zk+1 =vk+1

where vk+1 yields an idle with probability (1uk)xk , a success with probability xkuk(1uk)xk1,and a collision otherwise.

If we had perfect state information, the DP al-gorithm would be

Jk(xk) =gk(xk)+ min0uk1

Eak

p(xk, uk)Jk+1(xk+ ak

+ 1p(xk, uk)Jk+1(xk+ ak),p(xk, uk) is the success probability xkuk(1uk)xk1 The optimal (perfect state information) policywould be to select the value ofukthat maximizesp(xk, uk), sok(xk) =

1xk

, for allxk

1.


105/261

REFORMULATION AS A PERFECT INFO PROBLEM

We haveIk+1 = (Ik, zk+1, uk), k= 0, 1, . . . , N 2, I0 =z0.

View this as a dynamic system with stateIk, con-troluk, and random disturbancezk+1.

We have

P(zk+1| Ik, uk) =P(zk+1| Ik, uk, z0, z1, . . . , zk),

since z0, z1, . . . , zkare part of the information vec-

tor Ik. Thus the probability distribution of zk+1depends explicitly only on the stateIkand controlukand not on the prior disturbanceszk, . . . , z0.

Write

E

gk(xk, uk, wk)

=E

Exk,wk

gk(xk, uk, wk) | Ik, uk

so the cost per stage of the new system is

gk(Ik, uk) = Exk,wkgk(xk, uk, wk) | Ik, uk


106/261

DP ALGORITHM

Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:

Jk(Ik) = minukUk

E

xk,wk, zk+1

gk(xk, uk, wk)

+ Jk+1(Ik, zk+1, uk) | Ik, u

fork= 0, 1, . . . , N 2, and fork=N 1,

JN1(IN1) = minuN1UN1

ExN1, wN1

gN

fN1(xN1, uN1, wN1)

+ gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by

J =Ez0J0(z0).


107/261

MACHINE REPAIR EXAMPLE I

A machine can be in one of two states denotedP(good state) andP(bad state).

At the end of each period the machine is in-spected.

Two possible inspection outcomes: G (probably

good state) andB(probably bad state).

Transition probabilities:P P G

B

1/4

1/3

2/3 3/4

3/41

1/4

P P

State Transition Inspection

Possible actions after each inspection:C: Continue operation of the machine.

S: Stop the machine, determine its state, and ifinPbring it back to the good stateP.

Cost per stage:g(P, C) = 0, g(P, S) = 1, g(P , C) = 2, g(P , S) = 1


108/261

MACHINE REPAIR EXAMPLE II

The information vector at times0and1isI0 =z0, I1 = (z0, z1, u0),

and we seek functions 0(I0), 1(I1) that minimize

Ex0, w0, w1v0, v1

g

x0, 0(z0)

+g

x1, 1(z0, z1, 0(z0))

.

DP algorithm: Start with J2(I2) = 0. For k =0, 1, take the min over the two actions, C and S,

Jk(Ik) = min

P(xk =P | Ik)g(P, C)

+ P(xk =P | Ik)g(P , C)+ E

zk+1Jk+1(Ik, C , zk+1) | Ik, C,P(xk =P | Ik)g(P, S)

+ P(xk =P | Ik)g(P , S)

+ Ezk+1Jk+1(Ik, S , zk+1) | Ik, S


109/261


110/261

MACHINE REPAIR EXAMPLE IV

(2) ForI1 = (B,G,S)

P(x1 =P | B,G,S) =P(x1 =P | G,G,S) =7

J1(B,G,S) =

2

7 , 1(B,G,S) =C.

(3) ForI1 = (G,B,S)

P(x1 =P|

G, B|

S) =P(x1 =P , G, B, S )

P(G, B| S)=

13 34

23 34 + 13 14

23 14 + 13 34

23 34 + 13

=3

5,

J1(G,B,S) = 1, 1(G,B,S) =S.

Similarly, for all possibleI1, we computeJ1(I1),and 1(I1), which is to continue (u1 = C)if the

last inspection wasG, and to stop otherwise.


111/261


112/261


LECTURE 11

LECTURE OUTLINE

Review of DP for imperfect state info

Linear quadratic problems Separation of estimation and control


113/261

REVIEW: PROBLEM WITH IMPERFECT STATE INF

Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1

Ik: information vector available at timek:

I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k

Optimization over policies = {0, 1, . . . , N1}wherek(Ik)

Uk, for allIkand k.

Find a policythat minimizes

J = Ex0,wk,vk

k=0,...,N1

gN(xN) +

N1k=0

gk

xk, k(Ik), wk


xk+1 =fk

xk, k(Ik), wk

, k 0,



114/261

DP ALGORITHM

Reformulate to perfect state info problem, andwrite the DP algorithm:

Jk(Ik) = minukUk

E

xk,wk, zk+1

gk(xk, uk, wk)

+ Jk+1(Ik, zk+1, uk) | Ik, u

fork= 0, 1, . . . , N 2, and fork=N 1,


ExN1, wN1

gN

fN1(xN1, uN1, wN1)

+ gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by

J =Ez0J0(z0).


115/261

LINEAR-QUADRATIC PROBLEMS

System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost

Ewk

k=0,1,...,N1 xNQNxN+

N1

k=0(xkQkxk+ u

kRkuk)

whereQk 0andRk >0. Observations

zk =Ckxk+ vk, k= 0, 1, . . . , N 1. w0, . . . , wN1,v0, . . . , vN1indep. zero mean Key fact to show:

Optimal policy {0, . . . , N1} is of the form:

k(Ik) =LkE{xk | Ik}

Lk: same as for the perfect state info case

Estimation problem and control problem canbe solved separately


116/261

DP ALGORITHM I

Last stageN 1(supressing indexN 1):JN1(IN1) = min

uN1

ExN1,wN1

xN1QxN1

+ uN1RuN1+ (AxN1+ BuN1+ wN1)

Q(AxN1+ BuN1+ wN1) | IN1, uN1 Since E{wN1| IN1} = E{wN1} = 0, theminimization involves

minuN1

uN1(B

QB+ R)uN1

+ 2E{xN1 | IN1}AQBuN1

The minimization yields the optimalN1:

uN1 =N1(IN1) =LN1E{xN1| IN1}

where

LN1 = (B

QB+ R)1

B

QA


117/261

DP ALGORITHM II

Substituting in the DP algorithmJN1(IN1) = E

xN1

xN1KN1xN1| IN1

+ E

xN1xN1 E{xN1| IN1}

PN1xN1 E{xN1| IN1} | IN+ E

wN1

{wN1QNwN1},

where the matricesKN1andPN1are given by

PN1 =AN1QNBN1(RN1+ BN1QNBN1)

BN1QNAN1,KN1 =AN1QNAN1 PN1+ QN1.

Note the structure ofJN1: in addition to thequadratic and constant terms, it involves a quadraticin the estimation error

xN1 E{xN1| IN1}


118/261

DP ALGORITHM III

DP equation for periodN 2:JN2(IN2) = min

uN2

E

xN2,wN2,zN1

{xN2QxN2

+ uN2RuN2+ JN1(IN1) | IN2, uN2}=E

xN2QxN2 | IN2

+ min

uN2

uN2RuN2

+ ExN1KN1xN1 | IN2, uN2

+ E

xN1 E{xN1 | IN1}

PN1

xN1 E{xN1 | IN1} | IN2, uN2

+ EwN1{wN1QNwN1}.

Key point: We have excluded the next to lastterm from the minimization with respect touN2.

This term turns out to be independent ofuN2.


119/261

QUALITY OF ESTIMATION LEMMA

For everyk, there is a functionMksuch that wehave

xkE{xk | Ik} =Mk(x0, w0, . . . , wk1, v0, . . . , vk),

independently of the policy being used. The following simplified version of the lemmaconveys the main idea.

Simplified Lemma: Let r, u, zbe random vari-ables such thatrand uare independent, and let

x=r+ u. Then

x E{x | z, u} =r E{r | z}.

Proof: We have

x E{x | z, u} =r+ u E{r+ u | z, u}=r+ u E{r | z, u} u=r E{r | z, u}=r

E

{r|

z}

.


120/261

APPLYING THE QUALITY OF ESTIMATION LEMMA

Using the lemma,xN1 E{xN1| IN1} =N1,

where

N1: function ofx0, w0, . . . , wN2, v0, . . . , vN1

SinceN1is independent ofuN2, the condi-tional expectation ofN1PN1N1satisfies

E{N1PN1N1| IN2, uN2}=E{N1PN1N1| IN2}

and is independent ofuN2.

So minimization in the DP algorithm yields

uN2 =N2(IN2) =LN2E{xN2| IN2}


121/261

FINAL RESULT

Continuing similarly (using also the quality ofestimation lemma)

k(Ik) =LkE{xk| Ik},

whereLkis the same as for perfect state info:

Lk = (Rk+ BkKk+1Bk)1BkKk+1Ak,

withKkgenerated fromKN =QN,using

Kk =AkKk+1Ak Pk+ Qk,

Pk =AkKk+1Bk(Rk+ BkKk+1Bk)

1BkKk+1Ak

xk + 1= Akxk+ Bkuk+wk

Lk

uk

wk

xkzk= Ckxk+ vk

Delay

Estimator

E{xk|Ik}uk - 1

zk

vk

zkuk


122/261

SEPARATION INTERPRETATION

The optimal controller can be decomposed into(a) Anestimator, which uses the data to gener-

ate the conditional expectationE{xk| Ik}.(b) Anactuator, which multipliesE{xk| Ik} by

the gain matrix Lkand applies the control

inputuk =LkE{xk| Ik}. Generically the estimatexof a random vectorxgiven some information (random vector)I, whichminimizes the mean squared error

Ex{x x2 | I} = x2 2E{x | I}x + x2

isE{x|I} (set to zero the derivative with respectto xof the above quadratic form).

The estimator portion of the optimal controlleris optimal for the problem of estimating the statexkassuming the control is not subject to choice.

The actuator portion is optimal for the controlproblem assuming perfect state information.


123/261

STEADY STATE/IMPLEMENTATION ASPECTS

AsN , the solution of the Riccati equationconverges to a steady state andLk L. If x0, wk, and vkare Gaussian, E{xk| Ik}isalinearfunction ofIkand is generated by a nicerecursive algorithm, the Kalman filter.

The Kalman filter involves also a Riccati equa-tion, so for N , and a stationary system, italso has a steady-state structure.

Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.For nonGaussian uncertainty, computing E{xk | Ikmaybe very difficult, so a suboptimal solution is

typically used.

Most common suboptimal controller: Replace

E{xk | Ik} by the estimateproducedby the Kalmanfilter (act as ifx0,wk, andvkare Gaussian).

It can be shown that this controller is optimalwithin the class of controllers that arelinearfunc-tions ofIk.


124/261


LECTURE 12

LECTURE OUTLINE

DP for imperfect state info

Sufficient statisticsConditional state distribution as a sufficient statis-tic

Finite-state systems

Examples


125/261

REVIEW: PROBLEM WITH IMPERFECT STATE INF

Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 0

Ik: information vector available at timek:

I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k

Optimization over policies = {0, 1, . . . , N1}wherek(Ik)

Uk, for allIkand k.

Find a policythat minimizes

J = Ex0,wk,vk

k=0,...,N1

gN(xN) +

N1k=0

gk

xk, k(Ik), wk


xk+1 =fk

xk, k(Ik), wk

, k 0,



126/261

DP ALGORITHM

DP algorithm:Jk(Ik) = min

ukUk

E

xk,wk, zk+1

gk(xk, uk, wk)

+ Jk+1(Ik, zk+1, uk) | Ik, u

fork= 0, 1, . . . , N 2, and fork=N 1,


E

xN1, wN1

gN

fN1(xN1, uN1, wN1)

+ gN1(xN1, uN1, wN1) | IN1, uN1

The optimal costJ is given by

J =Ez0

J0(z0)

.


127/261

SUFFICIENT STATISTICS

Suppose that we can find a function Sk(Ik) suchthat the right-hand side of the DP algorithm canbe written in terms of some functionHkas

minukUk

HkSk(Ik), uk.Such a function Skis called a sufficient statistic. An optimal policy obtained by the precedingminimization can be written as

k(Ik) =k

Sk(Ik)

,

wherekis an appropriate function.

Example of a sufficient statistic: Sk(Ik) =Ik Another important sufficient statistic

Sk(Ik) =Pxk|Ik


128/261

DP ALGORITHM IN TERMS OFPXK |IK

It turns out thatPxk|Ikis generated recursivelyby a dynamic system (estimator) of the form

Pxk+1|Ik+1 = k

Pxk|Ik , uk, zk+1

for a suitable functionk DP algorithm can be written as

Jk(Pxk|Ik) = minukUk

E

xk,wk,zk+1gk(xk, uk, wk)

+ Jk+1k(Pxk|Ik , uk, zk+1) | Ik, uk

uk xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

k - 1

Actuator

xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)

System Measurement

P xk

| Ik

k


129/261

EXAMPLE: A SEARCH PROBLEM

At each period, decide to search or not searcha site that may contain a treasure.

If we search and a treasure is present, we findit with prob.and remove it from the site.

Treasures worth: V. Cost of search: C

States: treasure present & treasure not present Each search can be viewed as an observationof the state

Denote

pk : prob. of treasure present at the start of timek

withp0given.

p

kevolves at timekaccording to the equation

pk+1 =

pk if not search,0 if search and find treasur

pk(1)pk(1)+1pk

if search and no treasure


130/261

SEARCH PROBLEM (CONTINUED)

DP algorithm

Jk(pk) = max

0,C+pkV

+ (1

pk)Jk+1

pk(1 )

pk(1 ) + 1 pk,

withJN(pN) = 0.

Can be shown by induction that the functionsJksatisfy

Jk(pk) = 0, for allpk CV

Furthermore, it is optimal to search at periodkif and only if pkV C(expected reward from the next search the costof the search)


131/261

FINITE-STATE SYSTEMS

Suppose the system is a finite-state Markovchain, with states1, . . . , n.

Then the conditional probability distribution Pxk|Ikis a vector

P(xk = 1 | Ik), . . . , P (xk =n | Ik) The DP algorithm can be executed over then-dimensional simplex (state space is not expandingwith increasingk)

When the control and observation spaces arealso finite sets, it turns out that the cost-to-go func-

tionsJkin the DP algorithm are piecewise linearand concave (Exercise 5.7).

This is conceptually important and also (mod-erately) useful in practice.


132/261

INSTRUCTION EXAMPLE

Teaching a student some item. Possible statesareL: Item learned, orL: Item not learned.

Possible decisions: T: Terminate the instruc-tion, orT: Continue the instruction for one periodand then conduct a test that indicates whether the

student has learned the item.

The test has two possible outcomes: R: Studentgives a correct answer, or R: Student gives anincorrect answer.

Probabilistic structureL L R

rt

1 1

1 - r1 - tL RL

Cost of instruction is Iper period Cost of terminating instruction; 0 if student haslearned the item, andC >0if not.


133/261


134/261

INSTRUCTION EXAMPLE III

Write the DP algorithm asJk(pk) = min

(1 pk)C, I+ Ak(pk)

,

where

Ak(pk) =P(zk+1 =R | Ik)Jk+1(pk, R)+ P(zk+1 =R | Ik)Jk+1

(pk, R)

Can show by induction that Ak(p) are piecewiselinear, concave, monotonically decreasing, with

Ak1(p) Ak(p) Ak+1(p), for allp [0, 1].

0 p

C

I

I+ AN - 1(p)

I+ AN - 2(p)

I+ AN - 3(p)

1N- 1 N- 3N- 2 1 -

I

C


135/261


LECTURE 13

LECTURE OUTLINE

Suboptimal control

Certainty equivalent control Implementations and approximations Issues in adaptive control


136/261

PRACTICAL DIFFICULTIES OF DP

The curse of modeling The curse of dimensionality

Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

Quick explosion of the number of states incombinatorial problems

Intractability of imperfect state informationproblems

There may be real-time solution constraints A family of problems may be addressed. Thedata of the problem to be solved is given with

little advance notice

The problem data may change as the systemis controlled need for on-line replanning


137/261

CERTAINTY EQUIVALENT CONTROL (CEC)

Replace the stochastic problem with a deter-ministic problem

At each time k, the uncertain quantities are fixedat some typical values

Implementation for an imperfect info problem.

At each timek:

(1) Compute a state estimate xk(Ik)given thecurrent information vectorIk.

(2) Fix thewi, i

k, at somewi(xi, ui). Solve

the deterministic problem:

minimize gN(xN)+

N1i=k

gi

xi, ui, wi(xi, ui)

subject toxk =xk(Ik)and fori k,

ui Ui, xi+1 =fi

xi, ui, wi(xi, ui)

.

(3) Use as control the first element in the optimal

control sequence found.


138/261

ALTERNATIVE IMPLEMENTATION

Let d0(x0), . . . , dN1(xN1)be an optimalcontroller obtained from the DP algorithm for thedeterministic problem

minimize gN(xN) +

N1

k=0 gkxk, k(xk), wk(xk, uk)subject to xk+1 =fk

xk, k(xk), wk(xk, uk)

, k(xk)

The CEC applies at timekthe control input

k(Ik) =dkxk(Ik)

xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

Actuator

xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)

System Measurement

kd

u k =kd(xk)

xk(Ik)


139/261


140/261

PARTIALLY STOCHASTIC CEC

Instead of fixingallfuture disturbances to theirtypical values, fix only some, and treat the rest asstochastic.

Important special case: Treat an imperfect stateinformation problem as one of perfect state infor-

mation, using an estimate xk(Ik) of xkas if it wereexact.

Multiaccess Communication Example: Con-sider controlling the slotted Aloha system (dis-cussed in Ch. 5) by optimally choosing the prob-

ability of transmission of waiting packets. This isa hard problem of imperfect state info, whose per-

fect state info version is easy.

Natural partially stochastic CEC:

k(Ik) = min

1, 1xk(Ik)

,

wherexk(Ik)is an estimate of the current packetbacklog based on the entire past channel history

of successes, idles, and collisions (which isIk).


141/261


142/261

THE PROBLEM OF IDENTIFIABILITY

Suppose we consider two phases: A parameter identification phase (compute

an estimate of)

A control phase (apply control that would beoptimal if were true).

A fundamental difficulty: the control processmay make some of the unknown parameters in-

visible to the identification process.

Example: Consider the scalar systemxk+1 =axk+ buk+ wk, k= 0, 1, . . . , N 1,with the cost E

Nk=1(xk)

2

. If a and b are known,

the optimal control law isk(xk) = (a/b)xk.

If aand bare not known and we try to esti-

mate them while applying some nominal controllawk(xk) =xk, the closed-loop system is

xk+1 = (a + b)xk+ wk,

so identification can at best find (a+b)but not

the values of bothaandb.


143/261

CEC AND IDENTIFIABILITY I

Suppose we have P{xk+1 | xk, uk, }and weuse a control law that is optimal for known:

k(Ik) =k(xk,k), with k: estimate of

There are three systems of interest:(a) The system (perhaps falsely) believed by thecontroller to be true, which evolves proba-

bilistically according to

Pxk+1 | xk, (xk,k),k.(b) The true closed-loop system, which evolves

probabilistically according to

Pxk+1 | xk, (xk,k), .

(c) The optimal closed-loop system that corre-sponds to the true value of the parameter,

which evolves probabilistically according to

Pxk+1 | xk, (xk, ), .


144/261

CEC AND IDENTIFIABILITY II

System Believed to beTrue

P{xk +1|xk,*(xk, k), k

}

Optimal Closed-Loop System

P{xk +1|xk,*(xk,),}

True Closed-Loop System

P{xk +1|xk,*(xk, k),}

^

^

^

Difficulty: There is a built-in mechanism forthe parameter estimates to converge to a wrong

value.

Assume that for some =and allxk+1,xk,

P

xk+1 | xk, (xk,),

=P

xk+1 | xk, (xk,),

i.e., there is a false value of parameter for whichthe system under closed-loop control looks ex-actly as if the false value were true.

Then, if the controller estimates at some timethe parameter to be , subsequent data will tend

to reinforce this erroneous estimate.


145/261

REMEDY TO IDENTIFIABILITY PROBLEM

Introduce noise in the control applied, i.e., oc-casionally deviate from the CEC actions.

This provides a means to escape from wrongestimates.

However, introducing noise in the control may

be difficult to implement in practice.

Under some special circumstances, i.e., theself-tuning control context discussed in the book,the CEC is optimal in the limit, even if the param-

eter estimates converge to the wrong values. All of this touches upon some of the most so-phisticated aspects of adaptive control.


146/261


LECTURE 14

LECTURE OUTLINE

Limited lookahead policies

Performance bounds Computational aspects Problem approximation approach

Vehicle routing example

Heuristic cost-to-go approximation Computer chess


147/261

LIMITED LOOKAHEAD POLICIES

One-step lookahead (1SL) policy: At each kandstatexk, use the controlk(xk)that

minukUk(xk)

E

gk(xk, uk, wk)+Jk+1

fk(xk, uk, wk)

,

where

JN =gN. Jk+1: approximation to true cost-to-goJk+1

Two-step lookahead policy: At each kand xk,use the controlk(xk) attaining the minimum above,where the function Jk+1is obtained using a 1SLapproximation (solve a 2-step DP problem).

If Jk+1is readily available and the minimizationabove is not too hard, the 1SL policy is imple-

mentable on-line.Sometimes one also replaces Uk(xk) above witha subset of most promising controlsUk(xk).

As the length of lookahead increases, the re-quired computation quickly explodes.


148/261

PERFORMANCE BOUNDS

LetJk(xk)be the cost-to-go from(xk, k)of the1SL policy, based on functions Jk.

Assume that for all(xk, k), we have

Jk(xk)

Jk(xk), (*)

where JN =gNand for allk,

Jk(xk) = minukUk(xk)

E

gk(xk, uk, wk)

+ Jk+1fk(xk, uk, wk),

[so Jk(xk)is computed along withk(xk)]. Then

Jk(xk) Jk(xk), for all(xk, k).

Important application: When Jkis the cost-to-

go of some heuristic policy (then the 1SL policy is

called therolloutpolicy).

The bound can be extended to the case wherethere is akin the RHS of (*). Then

Jk(xk) Jk(xk) + k+ + N1


149/261

COMPUTATIONAL ASPECTS

Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly whenUk(xk)is not a discrete set]. Con-nection with the methodology of stochastic pro-gramming.

The choice of the approximating functions Jkiscritical, and is calculated with a variety of methods.

Some approaches:(a) Problem Approximation: Approximate the op-

timal cost-to-go with some cost derived froma related but simpler problem

(b) Heuristic Cost-to-Go Approximation: Approx-imate the optimal cost-to-go with a functionof a suitable parametric form, whose param-

eters are tuned by some heuristic or system-atic scheme (Neuro-Dynamic Programming)

(c) Rollout Approach: Approximate the optimalcost-to-go with the cost of some suboptimal

policy, which is calculated either analytically

or by simulation


150/261

PROBLEM APPROXIMATION

Many (problem-dependent) possibilities Replace uncertain quantities by nominal val-

ues, or simplify the calculation of expected

values by limited simulation

Simplify difficult constraints or dynamics

Example of enforced decomposition: Route mvehicles that move over a graph. Each node has

a value. The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints

(plus time windows and other constraints).

Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.

1SL scheme: At time kand state xk(positionof vehicles and collected value nodes), considerall possiblekth moves by the vehicles, and at theresulting states we approximate the optimal value-

to-go with the value collected by optimizing thevehicle routes one-at-a-time


151/261

HEURISTIC COST-TO-GO APPROXIMATION

Use a cost-to-go approximation from a paramet-ric class J(x, r)where xis the current state andr = (r1, . . . , rm)is a vector of tunable scalars(weights).

By adjusting the weights, one can change the

shape of the approximation Jso that it is reason-ably close to the true optimal cost-to-go function.

Two key issues: The choice of parametric class J(x, r)(the

approximation architecture).

Method for tuning the weights (training thearchitecture).

Successful application strongly depends on howthese issues are handled, and on insight about the

problem. Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.


152/261

APPROXIMATION ARCHITECTURES

Divided in linear and nonlinear [i.e., linear ornonlinear dependence of J(x, r)onr].

Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

Architectures based on feature extraction

Feature Extraction

MappingCost Approximator w/

Parameter Vector r

FeatureVector yStatex

Cost Approximation

J (y,r )

Ideally, the features will encode much of thenonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.

Sometimes the state space is partitioned, andlocal features are introduced for each subset ofthe partition (they are 0 outside the subset).

With a well-chosen feature vectory(x), we canuse a linear architecture

J(x, r) = Jy(x), r= i

riyi(x)


153/261

COMPUTER CHESS I

Programs use a feature-based position evalua-tor that assigns a score to each move/position

FeatureExtraction

Weightingof Features

Score

Features:Material balance,Mobility,Safety, etc

Position Evaluator

Most often the weighting of features is linear butmultistep lookahead is involved.

Most often the training is done by trial and error.

Additional features: Depth first search Variable depth search when dynamic posi-

tions are involved

Alpha-beta pruning


154/261

COMPUTER CHESS II

Multistep lookahead treeP (White to Move)

M2

(+16)

(+16) (+20)

(+8) (+16) (+20) (+8)

8 +20 +18 +16 +24 +20 +10 +12 -4 +8 +21 +11 -5 +10 +32 +27 +10 +9 +3

(+16)

(+11)

(+11)

(+11) Black to

Move

Black to Move

White to Mov

M1

P 2

P 1

P 3

P 4

CutoffCutoff

Cutoff

Cutoff

Alpha-beta pruning: As the move scores areevaluated by depth-first search, branches whoseconsideration (based on the calculations so far)

cannot possibly change the optimal move are ne-glected


155/261


LECTURE 15

LECTURE OUTLINE

Rollout algorithms

Cost improvement property Discrete deterministic problems Sequential consistency and greedy algorithms

Sequential improvement


156/261

ROLLOUT ALGORITHMS

One-step lookahead policy:At each kandstatexk, use the controlk(xk)that

minukUk(xk)

E

gk(xk, uk, wk)+Jk+1

fk(xk, uk, wk)

,

where JN =gN. Jk+1: approximation to true cost-to-goJk+1

Rollout algorithm: When Jkis the cost-to-goof some heuristic policy (called thebase policy)

Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usually

much better) cost than the base heuristic startingfrom the same state.

Main difficulty: Calculating Jk(xk)may be com-putationally intensive if the cost-to-go of the base

policy cannot be analytically calculated.

May involve Monte Carlo simulation if theproblem is stochastic.

Things improve in the deterministic case.


157/261

EXAMPLE: THE QUIZ PROBLEM

A person is givenNquestions; answering cor-rectly questionihas probabilitypi, with rewardvi.

Quiz terminates at the first incorrect answer. Problem: Choose the ordering of questions soas to maximize the total expected reward.

Assuming no other constraints, it is optimal touse the index policy: Questions should be an-swered in decreasing order of the index of pref-erencepivi/(1 pi). With minor changes in the problem, the indexpolicy need not be optimal. Examples:

A limit (< N) on the maximum number ofquestions that can be answered.

Time windows, sequence-dependent rewards

precedence constraints.

Rollout with the index policy as base policy:Convenient because at a given state (subset ofquestions already answered), the index policy and

its expected reward can be easily calculated.


158/261

COST IMPROVEMENT PROPERTY

LetJk(xk): Cost-to-go of the rollout policy

Hk(xk): Cost-to-go of the base policy

We claim thatJk(xk) Hk(xk)for allxkand kProof by induction: We have JN(xN) =HN(xN)for allxN. Assume that

Jk+1(xk+1) Hk+1(xk+1), xk+1.Then, for allxk

Jk(xk) =E

gk

xk, k(xk), wk

+ Jk+1

fk

xk, k(xk), wk

Egkxk, k(xk), wk+ Hk+1fkxk, k(xk), wkE

gk

xk, k(xk), wk

+ Hk+1

fk

xk, k(xk), wk

=Hk(xk)


159/261

EXAMPLE: THE BREAKTHROUGH PROBLEM

root

Given a binary tree withNstages.

Each arc is either free or is blocked (crossedout in the figure).

Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).

Base heuristic (greedy): Follow the right branchif free; else follow the left branch if free.

For large Nand given prob. of free branch:the rollout algorithm requires O(N)times morecomputation, but has O(N)times larger prob. of

finding a free path than the greedy algorithm.


160/261

DISCRETE DETERMINISTIC PROBLEMS

Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be rep-resented as a sequential decision process by us-

ing a tree.

The leaves of the tree correspond to the feasible

solutions.

The problem can be solved by DP, starting fromthe leaves and going back towards the root.

Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach ofNcities.

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Origin Node sA

Traveling salesman problem with four cities A, B, C, D


161/261

A CLASS OF GENERAL DISCRETE PROBLEMS

Generic problem: Given a graph with directed arcs A special nodescalled theorigin A set of terminal nodes, calleddestinations,

and a costg(i)for each destinationi.

Find min cost path starting at the origin, end-ing at one of the destination nodes.

Base heuristic: For any nondestination nodei,constructs a path(i, i1, . . . , im, i)starting atiand

ending at one of the destination nodes i. We callitheprojectionofi, and we denoteH(i) =g(i).

Rollout algorithm: Start at the origin; choosethe successor node with least cost projection

s i1 im

j1

j2

j3

j4

p(j1)

p(j2)

p(j3)

p(j4)

im-1

Neighbors of imProjections of

Neighbors of im


162/261

EXAMPLE: ONE-DIMENSIONAL WALK

A person takes either a unit step to the left or aunit step to the right. Minimize the costg(i)of thepointiwhere he will end up afterNsteps.

g(i)

iNN- 2-N 0

(N,0)

(0,0)

(N,-N) (N,N)

i_

i_

Base heuristic: Always go to the right. Rolloutfinds the rightmost local minimum.

Base heuristic: Compare always go to the rightand always go the left. Choose the best of thetwo. Rollout finds aglobal minimum.


163/261

SEQUENTIAL CONSISTENCY

The base heuristic is sequentially consistentiffor every node i, whenever it generates the path(i, i1, . . . , im, i)starting at i, it also generates thepath(i1, . . . , im, i)starting at the node i1(i.e., allnodes of its path have the same projection).

Prime example of a sequentially consistent heuristic is agreedy algorithm. It uses anestimateF(i)of the optimal cost starting fromi.

At the typical step, given a path(i, i1, . . . , im),whereimis not a destination, the algorithm adds

to the path a nodeim+1such that

im+1 = arg minjN(im)

F(j)

If the base heuristic is sequentially consistent,the cost of the rollout algorithm is no more thanthe cost of the base heuristic. In particular, if

(s, i1, . . . , im)is the rollout path, we have

H(s) H(i1) H(im1) H(im)whereH(i) =cost of the heuristic starting fromi.


164/261


165/261


LECTURE 16

LECTURE OUTLINE

More on rollout algorithms

Simulation-based methods Approximations of rollout algorithms Rolling horizon approximations

Discretization issues

Other suboptimal approaches


166/261

ROLLOUT ALGORITHMS

Rollout policy:At each kand state xk, usethe controlk(xk)that

minukUk(xk)

Qk(xk, uk),

where

Qk(xk, uk) =E

gk(xk, uk, wk)+Hk+1

fk(xk, uk, wk)

andHk+1(xk+1)is the cost-to-go of the heuristic.

Qk(xk, uk) is called the Q-factorof (xk, uk), andfor a stochastic problem, its computation may in-

volve Monte Carlo simulation.

Potential difficulty: To minimize overuktheQ-factor, we must form Q-factor differences Qk(xk, u)

Qk(xk, u). This differencing often amplifies thesimulation error in the calculation of the Q-factors.

Potential remedy: Compare any two controlsuanduby simulatingQk(xk, u)Qk(xk, u)directly.


167/261


168/261

ROLLING HORIZON APPROACH

This is an l-step lookahead policy where thecost-to-go approximation is just 0.

Alternatively, the cost-to-go approximation is theterminal cost functiongN.

A short rolling horizon saves computation.

Paradox: It is not true that a longer rollinghorizon always improves performance.

Example: At the initial state, there are two con-trols available (1 and 2). At every other state, there

is only one control.

CurrentState

Optimal Trajectory

HighCost

... ...

... ...

1

2

LowCost

HighCost

l Stages


169/261

ROLLING HORIZON COMBINED WITH ROLLOUT

We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.

Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.

Example: N-stage stopping problem where the

stopping cost is 0, the continuation cost is eitheror 1, where 0 < < 1/N, and the first statewith continuation cost equal to 1 is statem. Thenthe optimal policy is to stop at state m, and theoptimal cost is

m.

0 1 2 m N

Stopped State

1... ...

Consider the heuristic that continues at everystate, and the rollout policy that is based on thisheuristic, with a rolling horizon of l msteps. It will continue up to the firstm l+ 1stages,thus compiling a cost of (m l + 1). The rolloutperformance improves as lbecomes shorter!


170/261


171/261

GENERAL APPROACH FOR DISCRETIZATION I

Given a discrete-time system with state spaceS, consider a finite subsetS; for exampleScouldbe a finite grid within a continuous state spaceS.Assume stationarity for convenience, i.e., that thesystem equation and cost per stage are the same

for all times.We define an approximation to the original prob-lem, with state spaceS, as follows:

Express each x Sas a convex combinationof states inS, i.e.,

x=xiS

i(x)xi wherei(x) 0,i

i(x) = 1

Define a reduced dynamic system with state

space S, whereby from each xi Swe move tox = f(xi, u , w)according to the system equationof the original problem, and then move to xj Swith probabilitiesj(x).

Define similarly the corresponding cost per stage

of the transitions of the reduced system.


172/261

GENERAL APPROACH FOR DISCRETIZATION II

mit dynamic programming lecture slides

Documents