mit dynamic programming lecture slides

Upload: scatterwalker

Post on 02-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    1/261

    LECTURE SLIDES ON DYNAMIC PROGRAMMING

    BASED ON LECTURES GIVEN AT THE

    MASSACHUSETTS INSTITUTE OF TECHNOLOGY

    CAMBRIDGE, MASS

    FALL 2004

    DIMITRI P. BERTSEKAS

    These lecture slides are based on the book:Dynamic Programming and Optimal Control:2nd edition, Vols. I and II, Athena Scientific,2001, by Dimitri P. Bertsekas; see

    http://www.athenasc.com/dpbook.html

    Last Updated: December 2004

    The slides are copyrighted, but may be freelyreproduced and distributed for any noncom-mercial purpose.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    2/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 1

    LECTURE OUTLINE

    Problem Formulation

    Examples The Basic Problem Significance of Feedback

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    3/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    4/261

    BASIC STRUCTURE OF STOCHASTIC DP

    Discrete-time systemxk+1 =fk(xk, uk, wk), k= 0, 1, . . . , N 1

    k:Discrete time xk:State;summarizes past information that

    is relevant for future optimization

    uk: Control;decision to be selected at timekfrom a given set

    wk: Random parameter(also called distur-

    bance or noise depending on the context)

    N: Horizonor number of times control isapplied

    Cost function that is additive over time

    E

    gN(xN) +

    N1k=0

    gk(xk, uk, wk)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    5/261

    INVENTORY CONTROL EXAMPLE

    InventorySystem

    Stock Ordered at

    Period k

    Stock at Period k Stock at Period k + 1

    Demand at Periodk

    xk

    wk

    xk + 1= xk+uk -wk

    ukCost of Period k

    cuk+ r (xk +uk-wk)

    Discrete-time system

    xk+1 =fk(xk, uk, wk) =xk+ uk

    wk

    Cost function that is additive over time

    E

    gN(xN) +

    N1k=0

    gk(xk, uk, wk)

    =EN1

    k=0

    cuk+ r(xk+ uk wk)

    Optimizationover policies: Rules/functions uk =k(xk)that map states to controls

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    6/261

    ADDITIONAL ASSUMPTIONS

    The set of values that the control ukcan takedepend at most onxkand not on priorxoru

    Probability distribution of wkdoes not dependon past valueswk1, . . . , w0, but may depend onxkanduk

    Otherwise past values of wor xwould beuseful for future optimization

    Sequence of events envisioned in periodk: xkoccurs according to

    xk =fk1

    xk1, uk1, wk1

    ukis selected with knowledge ofxk, i.e.,

    uk

    U(xk)

    wkis random and generated according to adistribution

    Pwk(xk, uk)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    7/261

    DETERMINISTIC FINITE-STATE PROBLEMS

    Scheduling example: Find optimal sequence ofoperations A, B, C, D

    A must precede B, and C must precede D Given startup costSAand SC, and setup tran-sition costCmnfrom operationmto operationn

    A

    SA

    C

    SC

    AB

    CAB

    ACCAC

    CDA

    CAD

    ABC

    CA

    CCD CD

    ACD

    ACB

    CAB

    CAD

    CBC

    CCB

    CCD

    CAB

    CCA

    CDA

    CCD

    CBD

    CDB

    CBD

    CDB

    CAB

    InitialState

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    8/261

    STOCHASTIC FINITE-STATE PROBLEMS

    Example: Find two-game chess match strategy Timidplay draws with prob. pd > 0and loseswith prob.1 pd. Boldplay wins with prob.pw jbetter than thecurrent path s --> j ?)

    Is di+ a

    ij< UPPER

    ?

    (Does the path s --> i --> jhave a chance to be partof a shorter s --> t path ?)

    YES

    YES

    INSERT

    OPEN

    Set dj = di+ aij

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    32/261

    EXAMPLE

    ABC ABD ACB ACD ADB ADC

    ABCD

    AB AC AD

    ABDC ACBD ACDB ADBC ADCB

    Artificial Terminal Node t

    Origin Node sA

    1

    11

    20 20

    2020

    44

    4 4

    1515 5

    5

    3 3

    5

    33

    15

    2

    3

    4

    5

    6

    7

    8

    9

    Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

    0 - 1 1 1 2, 7,10 2 2 3, 5, 7, 10 3 3 4, 5, 7, 10 4 4 5, 7, 10 43

    5 5 6, 7, 10 43

    6 6 7, 10 13

    7 7 8, 10 13

    8 8 9, 10 13

    9 9 10 13

    10 10 Empty 13

    Note thatsome nodes never entered OPEN

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    33/261

    LABEL CORRECTING METHODS

    Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})

    i j

    REMOVE

    Is di+ aij< dj ?

    (Is the path s --> i --> jbetter than thecurrent path s --> j ?)

    Is di+ aij< UPPER ?

    (Does the path s --> i --> j

    have a chance to be partof a shorter s --> t path ?)

    YES

    YES

    INSERT

    OPEN

    Set dj = di+ aij

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    34/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 4

    LECTURE OUTLINE

    Label correcting methods for shortest paths

    Variants of label correcting methods

    Branch-and-bound as a shortest path algorithm

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    35/261

    LABEL CORRECTING METHODS

    Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})

    i j

    REMOVE

    Is di+ aij< dj ?

    (Is the path s --> i --> jbetter than thecurrent path s --> j ?)

    Is di+ aij< UPPER ?

    (Does the path s --> i --> j

    have a chance to be partof a shorter s --> t path ?)

    YES

    YES

    INSERT

    OPEN

    Set dj = di+ aij

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    36/261

    VALIDITY OF LABEL CORRECTING METHODS

    Proposition:If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-

    tination.

    Proof: (1) Each time a node j enters OPEN,its label is decreased and becomes equal to thelength of some path fromstoj

    (2) The number of possible distinct path lengthsis finite, so the number of times a node can enter

    OPEN is finite, and the algorithm terminates(3) Let (s, j1, j2, . . . , jk, t)be a shortest path andlet d be the shortest distance. If UPPER > d

    at termination, UPPER will also be larger than the

    length of all the paths (s, j1, . . . , jm), m= 1, . . . , k,

    throughout the algorithm. Hence, node jk willnever enter the OPEN list with djk equal to theshortest distance from s tojk. Similarly nodejk1will never enter the OPEN list with djk1equal tothe shortest distance fromsto jk1. Continue to

    j1to get a contradiction.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    37/261

    MAKING THE METHOD EFFICIENT

    Reduce the value of UPPER as quickly as pos-sible

    Try to discover good s tpaths early inthe course of the algorithm

    Keep the number of reentries into OPEN low

    Try to remove from OPEN nodes with smalllabel first.

    Heuristic rationale: if di is small, then djwhen set to di+aijwill be accordingly small,so reentrance of jin the OPEN list is lesslikely.

    Reduce the overhead for selecting the node tobe removed from OPEN

    These objectives are often in conflict. They giverise to a large variety of distinct implementations Good practical strategies try to strike a compro-mise between low overhead and small label nodeselection.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    38/261

    NODE SELECTION METHODS

    Depth-first search:Remove from the top ofOPEN and insert at the top of OPEN.

    Has low memory storage properties (OPENis not too long). Reduces UPPER quickly.

    Origin Nodes

    Destination Node t

    4

    2

    3

    4 5

    6

    7 8 9

    3

    2

    Best-first search (Djikstra):Remove fromOPEN a node with minimum value of label.

    Interesting property: Each node will be in-serted in OPEN at most once.

    Many implementations/approximations

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    39/261

    ADVANCED INITIALIZATION

    Instead of starting from di =for all i= s,start with

    di =length of some path from stoi (ordi = )

    OPEN= {i =t | di < }

    Motivation: Get a small starting value of UP-PER.

    No node with shortest distanceinitial valueof UPPER will enter OPEN

    Good practical idea: Run a heuristic (or use common sense) to

    get a good starting pathPfromstot

    Use as UPPER the length of P, and as dithe path distances of all nodes ialongP

    Very useful also in reoptimization, where wesolve the same problem with slightly different data

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    40/261

    VARIANTS OF LABEL CORRECTING METHODS

    If a lower bound hj of the true shortest dis-tance fromjto tis known, use the test

    di+ aij + hj

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    41/261

    BRANCH-AND-BOUND METHOD

    Problem: Minimize f(x)over a finiteset offeasible solutionsX.

    Idea of branch-and-bound: Partition the feasi-ble set into smaller subsets, and then calculatecertain bounds on the attainable cost within some

    of the subsets to eliminate from further consider-ation other subsets.

    Bounding Principle

    Given two subsetsY1 XandY2 X, supposethat we have bounds

    f1 min

    xY1f(x), f2 min

    xY2f(x).

    Then, if f2 f1, the solutions in Y1may be dis-regarded since their cost cannot be smaller thanthe cost of the best solution inY2.

    The B+B algorithm can be viewed as a la-bel correcting algorithm, where lower bounds de-fine the arc costs, and upper bounds are used to

    strengthen the test for admission to OPEN.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    42/261

    SHORTEST PATH IMPLEMENTATION

    Acyclic graph/partition of Xinto subsets (typi-cally a tree). The leafs consist of single solutions.

    Upper/Lower bounds fY

    and fYfor the mini-mum cost over each subsetYcan be calculated.

    The lower bound of a leaf

    {x}

    isf(x)

    Each arc(Y, Z)has lengthfZ f

    Y

    Shortest distance fromXtoY=fY f

    X

    Distance from origin Xto a leaf {x} is f(x)fX

    Distance from origin Xto a leaf {x} is f(x)fX Shortest path from Xto the set of leafs givesthe optimal cost and optimal solution

    UPPER is the smallest f(x)out of leaf nodes

    {x}

    examined so far {1,2,3,4,5}

    {1,2,}

    {4,5}{1,2,3}

    {1} {2}

    {3} {4} {5}

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    43/261

    BRANCH-AND-BOUND ALGORITHM

    Step 1:Remove a nodeYfrom OPEN. For eachchildYjofY, do the following: If fY j

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    44/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 5

    LECTURE OUTLINE

    Examples of stochastic DP problems

    Linear-quadratic problems Inventory control

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    45/261

    LINEAR-QUADRATIC PROBLEMS

    System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost

    Ewk

    k=0,1,...,N1 xNQNxN+

    N1

    k=0(xkQkxk+ u

    kRkuk)

    where Qk 0 and Rk >0 (in the positive (semi)definsense).

    wkare independent and zero mean DP algorithm:

    JN(xN) =xNQNxN,

    Jk(xk) = minuk

    E

    xkQkxk+ ukRkuk

    + Jk+1(Akxk+ Bkuk+ wk) Key facts: Jk(xk)is quadratic Optimal policy {0, . . . , N1} is linear:

    k(xk) =Lkxk

    Similar treatment of a number of variants

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    46/261

    DERIVATION

    By induction verify thatk(xk) =Lkxk, Jk(xk) =x

    kKkxk+constant,

    whereLkare matrices given by

    Lk = (BkKk+1Bk+ Rk)1BkKk+1Ak,

    and where Kkare symmetric positive semidefinitematrices given by

    KN =QN,

    Kk =Ak

    Kk+1 Kk+1Bk(BkKk+1Bk+ Rk)1BkKk+1

    Ak+ Qk.

    This is called thediscrete-time Riccati equation. Just like DP, it starts at the terminal timeNandproceeds backwards.

    Certainty equivalence holds (optimal policy isthe same as whenwkis replaced by its expected

    valueE{wk} = 0).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    47/261

    ASYMPTOTIC BEHAVIOR OF RICCATI EQUATION

    Assume time-independent system and cost perstage, and some technical assumptions: contro-lability of(A, B)and observability of(A, C)whereQ=CC

    The Riccati equation converges limk Kk =

    K, where Kis pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of thealgebraic Riccati equation

    K=AK KB(BKB+ R)1BKA + Q

    The corresponding steady-state controller (x) =Lx, where

    L= (BKB+ R)1BKA,

    is stable in the sense that the matrix(A + BL)ofthe closed-loop system

    xk+1 = (A + BL)xk+ wk

    satisfieslimk A + BL k = 0.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    48/261

    GRAPHICAL PROOF FOR SCALAR SYSTEMS

    A

    2

    RB

    2 + Q

    P 0

    Q

    F(P)

    450

    PPk Pk + 1 P*

    -R

    B

    2

    Riccati equation (withPk =KNk):

    Pk+1 =A2

    Pk B2P2k

    B2Pk+ R

    + Q,

    orPk+1 =F(Pk),where

    F(P) = A2RP

    B2P+ R+ Q.

    Note the two steady-state solutions, satisfying

    P =F(P), of which only one is positive.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    49/261

    RANDOM SYSTEM MATRICES

    Suppose that {A0, B0}, . . . , {AN1, BN1} arenot known but rather are independent random ma-trices that are also independent of thewk

    DP algorithm is

    JN(xN) =xNQNxN,

    Jk(xk) = minuk

    Ewk,Ak,Bk

    xkQkxk

    + ukRkuk+ Jk+1(Akxk+ Bkuk+ wk) Optimal policyk(xk) =Lkxk,where

    Lk =

    Rk+ E{BkKk+1Bk}

    1

    E{BkKk+1Ak},

    and where the matricesKkare given by

    KN =QN,

    Kk =E{AkKk+1Ak} E{AkKk+1Bk}

    Rk+ E{BkKk+1Bk

    }1

    E

    {BkKk+1Ak

    }+ Q

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    50/261

    PROPERTIES

    Certainty equivalence may not hold Riccati equation may not converge to a steady-state

    Q

    450

    0 P

    F (P)

    - R

    E{B2}

    We havePk+1 = F(Pk),where

    F(P) = E{A2}RPE{B2}P+ R + Q +

    T P2

    E{B2}P+ R ,

    T =E{A2}E{B2} E{A}2

    E{B}2

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    51/261

    INVENTORY CONTROL

    xk: stock, uk: inventory purchased, wk: de-mand

    xk+1 =xk+ uk wk, k= 0, 1, . . . , N 1

    Minimize

    E

    N1k=0

    cuk+ r(xk+ uk wk)

    where, for somep >0andh >0,

    r(x) =p max(0, x) + h max(0, x)

    DP algorithm:

    JN(xN) = 0,

    Jk(xk) = minuk0

    cuk+H(xk+uk)+E

    Jk+1(xk+ukwk)

    whereH(x + u) =E{r(x + u w)}.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    52/261

    OPTIMAL POLICY

    DP algorithm can be written asJN(xN) = 0,

    Jk(xk) = minuk0

    Gk(xk+ uk) cxk,

    where

    Gk(y) =cy+ H(y) + E

    Jk+1(y w)

    .

    If Gkis convex and lim|x| Gk(x) , wehave

    k(xk) =

    Sk xk ifxk < Sk,0 ifxk Sk,

    whereSkminimizesGk(y).

    This is shown, assuming thatc < p, by showingthatJkis convex for allk, and

    lim|x|

    Jk(x)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    53/261

    JUSTIFICATION

    Graphical inductive proof thatJkis convex.

    - cy

    - cy

    y

    H(y)

    cy+ H(y)

    SN - 1

    cSN - 1

    JN - 1(xN - 1)

    xN - 1SN - 1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    54/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 6

    LECTURE OUTLINE

    Stopping problems

    Scheduling problems Other applications

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    55/261

    PURE STOPPING PROBLEMS

    Two possible controls: Stop (incur a one-time stopping cost, and

    move to cost-free and absorbing stop state)

    Continue [using xk+1 = fk(xk, wk)and in-curring the cost-per-stage]

    Each policy consists of apartitionof the set ofstatesxkinto two regions:

    Stop region, where we stop Continue region, where we continue

    STOPREGION

    CONTINUEREGION

    Stop State

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    56/261

    EXAMPLE: ASSET SELLING

    A person has an asset, and at k= 0, 1, . . . , N 1receives a random offerwk

    May accept wkand invest the money at fixedrate of interest r, or reject wkand wait for wk+1.Must accept the last offerwN1

    DP algorithm (xk: current offer,T: stop state):

    JN(xN) =

    xN ifxN=T,0 ifxN =T,

    Jk(xk) =max(1 + r)Nkxk, EJk+1(wk) ifxk =T

    0 ifxk =T

    Optimal policy;

    accept the offerxk ifxk > k,

    reject the offerxk ifxk < k,

    where

    k = E

    Jk+1(wk)

    (1 + r)Nk .

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    57/261

    FURTHER ANALYSIS

    0 1 2 N- 1 N k

    ACCEPT

    REJECT

    1

    N - 1

    2

    Can show thatk k+1for allkProof: Let Vk(xk) =Jk(xk)/(1 + r)Nk for xk=

    T.Then the DP algorithm is

    VN(xN) =xNand

    Vk(xk) = max

    xk, (1 + r)1 E

    w

    Vk+1(w)

    .

    We have k =EwVk+1(w)/(1 + r), so it is enoughto show thatVk(x) Vk+1(x)for allxandk. Startwith VN1(x) VN(x)and use the monotonicityproperty of DP.

    We can also show thatk aas k .Suggests that for an infinite horizon the optimal

    policy is stationary.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    58/261

    GENERAL STOPPING PROBLEMS

    At timek, we may stop at costt(xk)or choosea controluk U(xk)and continue

    JN(xN) =t(xN),

    Jk(xk) = mint(xk), minukU(xk) Eg(xk, uk, wk)+ Jk+1

    f(xk, uk, wk)

    Optimal to stop at timekfor statesxin the set

    Tk = x t(x) minuU(x) Eg(x,u,w) + Jk+1f(x,u,w) Since JN1(x) JN(x), we have Jk(x) Jk+1(x)for allk, so

    T0

    Tk

    Tk+1

    TN1

    .

    Interesting case is when all theTkare equal (toTN1, the set where it is better to stop than to goone step and stop). Can be shown to be true if

    f(x,u,w) TN1, for allx TN1, u U(x),

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    59/261

    SCHEDULING PROBLEMS

    Set of tasks to perform, the ordering is subjectto optimal choice.

    Costs depend on the orderThere may be stochastic uncertainty, and prece-dence and resource availability constraints

    Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehicle

    routing, etc.)

    Some special problems admit a simple quasi-

    analytical solution method Optimal policy has an index form, i.e., each

    task has an easily calculable index, andit is optimal to select the task that has themaximum value of index (multi-armed bandit

    problems - to be discussed later) Some problems can be solved by aninter-

    change argument(start with some sched-ule, interchange two adjacent tasks, and see

    what happens)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    60/261

    EXAMPLE: THE QUIZ PROBLEM

    Given a list ofNquestions. If questioniis an-swered correctly (given probabilitypi), we receiverewardRi; if not the quiz terminates. Choose or-der of questions to maximize expected reward.

    Let iand jbe the kth and(k+ 1)st questions

    in an optimally ordered list

    L= (i0, . . . , ik1, i , j , ik+2, . . . , iN1)

    E{reward ofL} =E

    reward of{i0, . . . , ik1}+pi0 pik1 (piRi+pipjRj)+pi0 pik1pipjE

    reward of {ik+2, . . . , iN1}

    Consider the list withiandjinterchanged

    L = (i0, . . . , ik1, j , i , ik+2, . . . , iN1)

    Since L is optimal, E{reward ofL} E{reward ofLso it follows thatpiRi +pipjRj pjRj+pjpiRior

    piRi/(1 pi) pjRj/(1 pj).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    61/261

    MINIMAX CONTROL

    Consider basic problem with the difference thatthe disturbancewkinstead of being random, it is

    just known to belong to a given setWk(xk, uk).

    Find policythat minimizes the cost

    J(x0) = maxwkWk(xk,k(xk))

    k=0,1,...,N1

    gN(xN)+

    N1

    k=0gk

    xk, k(xk), w

    The DP algorithm takes the form

    JN(xN) =gN(xN),

    Jk(xk) = minukU(xk) maxwkWk(xk,uk)gk(xk, uk, wk)

    + Jk+1

    fk(xk, uk, wk)

    (Exercise 1.5 in the text, solution posted on thewww).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    62/261

    UNKNOWN-BUT-BOUNDED CONTROL

    For each k, keep the xkof the controlled systemxk+1 =fk

    xk, k(xk), wk

    inside a given setXk, thetarget set at timek.

    This is a minimax control problem, where thecost at stagekis

    gk(xk) =

    0 ifxk Xk,1 ifxk / Xk.

    We must reach at timekthe setXk =

    xk | Jk(xk) = 0

    in order to be able to maintain the state within the

    subsequent target sets.Start with XN =XN, and for k= 0, 1, . . . , N 1,

    Xk =

    xk Xk | there existsuk Uk(xk)such thfk(xk, uk, wk)

    Xk+1, for allwk

    Wk(xk, uk

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    63/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 7

    LECTURE OUTLINE

    Deterministic continuous-time optimal control

    Examples Connection with the calculus of variations The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm

    The Hamilton-Jacobi-Bellman equation as a suf-ficient condition

    Examples

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    64/261

    PROBLEM FORMULATION

    We have a continuous-time dynamic systemx(t) =f

    x(t), u(t)

    , 0 t T, x(0) : given,

    where

    x(t)

    n is the state vector at timet

    u(t)U m is the control vector at timet,Uis the control constraint set

    Tis the terminal time.Any admissible control trajectory u(t) | t [0, T](piecewise continuous function u(t) | t [0, T]with u(t) U for all t [0, T]), uniquely deter-mines

    x(t) | t [0, T].

    Find an admissible control trajectory

    u(t) | t

    [0, T] and corresponding state trajectory x(t) | t [0, T], that minimizes a cost function of the formh

    x(T)

    +

    T0

    g

    x(t), u(t)

    dt

    f ,h,gare assumed continuously differentiable.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    65/261

    EXAMPLE I

    Motion control: A unit mass moves on a lineunder the influence of a forceu.

    x(t) = x1(t), x2(t): position and velocity ofthe mass at timet

    Problem: From a given x1(0), x2(0), bringthe mass near a given final position-velocity pair

    (x1, x2)at timeTin the sense:

    minimize

    x1(T) x1

    2

    +

    x2(T) x2

    2

    subject to the control constraint

    |u(t)| 1, for allt [0, T].

    The problem fits the framework with

    x1(t) =x2(t), x2(t) =u(t),

    h

    x(T)

    =

    x1(T) x1

    2

    +

    x2(T) x2

    2

    ,

    gx(t), u(t)= 0, for allt [0, T].

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    66/261

    EXAMPLE II

    A producer with production rate x(t)at time tmay allocate a portion u(t)of his/her productionrate to reinvestment and1 u(t)to production ofa storable good. Thusx(t)evolves according to

    x(t) =u(t)x(t),

    where >0is a given constant.

    The producerwants tomaximize the total amountof product stored

    T0

    1 u(t)x(t)dt

    subject to

    0 u(t) 1, for allt [0, T].

    The initial production rate x(0) is a given positivenumber.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    67/261

    EXAMPLE III (CALCULUS OF VARIATIONS)

    Length =0T

    1 + (u(t))2 dt

    x(t)

    T t0

    x(t) =u(t).

    Given

    Point Given

    Line

    Find a curve from a given point to a given linethat has minimum length.

    The problem is

    minimize

    T0

    1 +

    x(t)

    2dt

    subject to x(0) =.

    Reformulation as an optimal control problem:

    minimize

    T0

    1 +

    u(t)

    2dt

    subject to x(t) =u(t), x(0) =.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    68/261

    HAMILTON-JACOBI-BELLMAN EQUATION I

    We discretize [0, T] at times 0, , 2 , . . . , N ,where=T /N, and we let

    xk =x(k), uk =u(k), k= 0, 1, . . . , N .

    We also discretize the system and cost:

    xk+1 =xk+f(xk, uk), h(xN)+N1k=0

    g(xk, uk).

    We write the DP algorithm for the discretizedproblem

    J(N,x) =h(x),

    J(k,x) = minuUg(x, u)+

    J(k+1), x+f(x, u) Assume J is differentiable and Taylor-expand:

    J(k,x) = min

    uUg(x, u) +J(k,x) +tJ

    (k,x)

    +xJ

    (k,x)

    f(x, u) + o().

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    69/261

    HAMILTON-JACOBI-BELLMAN EQUATION II

    Let J(t, x) be the optimal cost-to-go of the con-tinuous problem. Assuming the limit is valid

    limk, 0, k=t

    J(k,x) =J(t, x), for allt, x,

    we obtainfor allt, x,0 = min

    uU

    g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)

    with the boundary conditionJ(T, x) =h(x).

    This is the Hamilton-Jacobi-Bellman (HJB) equa-tion apartialdifferential equation, which is sat-isfied for all time-state pairs (t, x) by the cost-to-gofunction J(t, x) (assuming J is differentiable andthe preceding informal limiting procedure is valid).

    It is hard to tell a prioriif J

    (t, x) is differentiable. So we use the HJB Eq. as a verification tool; ifwe can solve it for a differentiableJ(t, x), then:

    J is the optimal-cost-to-go function

    The control (t, x) that minimizes in the RHS

    for each(t, x)defines an optimal control

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    70/261

    VERIFICATION/SUFFICIENCY THEOREM

    SupposeV(t, x)is a solution to the HJB equa-tion; that is, Vis continuously differentiable in tandx, and is such that for allt, x,

    0 = minuUg(x, u) + tV(t, x) + xV(t, x)

    f(x, u),V(T, x) =h(x), for allx.

    Suppose also that(t, x)attains the minimumabove for alltandx.

    Let x(t) | t [0, T]andu(t) =t, x(t),t [0, T], be the corresponding state and controltrajectories.

    Then

    V(t, x) =J(t, x), for allt, x,

    and

    u(t) | t [0, T]is optimal.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    71/261

    PROOF

    Let

    {(u(t),x(t))

    |t

    [0, T]

    }be any admissible contro

    state trajectory. We have for allt [0, T]0g

    x(t), u(t)

    +tV

    t, x(t)

    +xV

    t, x(t)

    f

    x(t), u(t)

    Using the system equation x(t) = f

    x(t),u(t)

    ,

    the RHS of the above is equal tog

    x(t),u(t)

    + d

    dt

    V(t,x(t))

    Integrating this expression overt [0, T],

    0 T0

    gx(t),u(t)dt + VT,x(T)V0,x(0).UsingV(T, x) =h(x)and x(0) =x(0), we have

    V0, x(0) hx(T) + T

    0

    gx(t),u(t)dt.If we useu(t)andx(t)in place of u(t)andx(t),the inequalities becomes equalities, and

    V0, x(0)=hx(T) + T

    0

    gx(t), u(t)dt.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    72/261

    EXAMPLE OF THE HJB EQUATION

    Consider the scalar systemx(t) =u(t), with |u(t)| 1and cost(1/2)x(T)2.The HJB equation is0 = min

    |u|1

    tV(t, x) + xV(t, x)u, for allt, x,with the terminal conditionV(T, x) = (1/2)x2. Evident candidate for optimality: (t, x) =sgn(x). Corresponding cost-to-go

    J(t, x) = 1

    2max0,|x| (T t)2

    .

    We verify thatJ solves the HJB Eq., and thatu= sgn(x)attains the min in the RHS. Indeed,

    tJ(t, x) = max0, |x| (T t)

    ,

    xJ(t, x) =sgn(x) max0,|x| (T t).Substituting, the HJB Eq. becomes

    0 = min|u|11 +sgn(x) umax0, |x| (T t)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    73/261

    LINEAR QUADRATIC PROBLEM

    Consider then-dimensional linear systemx(t) =Ax(t) + Bu(t),

    and the quadratic cost

    x(T)

    QTx(T) + T

    0x(t)Qx(t) + u(t)Ru(t)dt

    The HJB equation is

    0 = minum x

    Qx+uRu+tV(t, x)+xV(t, x)

    (Ax+Bu

    with the terminal conditionV(T, x) = xQTx.Wetry a solution of the form

    V(t, x) =xK(t)x, K(t) :n

    nsymmetric,

    and show thatV(t, x)solves the HJB equation if

    K(t) = K(t)AAK(t)+K(t)BR1BK(t)Q

    with the terminal conditionK(T) =QT.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    74/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 8

    LECTURE OUTLINE

    Deterministic continuous-time optimal control

    From the HJB equation to the Pontryagin Mini-mum Principle

    Examples

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    75/261

    THE HJB EQUATION

    Continuous-time dynamic systemx(t) =f

    x(t), u(t)

    , 0 t T, x(0) :given

    Cost function

    h

    x(T)

    + T0

    g

    x(t), u(t)

    dt

    J(t, x): optimal cost-to-go fromxat timet

    HJB equation:For all(t, x)

    0 = minuU

    g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)

    with the boundary conditionJ(T, x) =h(x).

    Verification theorem: If we can find a solution, it

    must be equal to the optimal cost-to-go function.

    Also a (closed-loop) policy(t, x)such that

    (t, x)attains the min for each(t, x)

    is optimal.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    76/261

    HJB EQ. ALONG AN OPTIMAL TRAJECTORY

    Observation I: An optimal control-state trajec-tory pair {(u(t), x(t)) | t [0, T]satisfies for allt [0, T]

    u(t) = arg min

    uUg

    x(t), u

    +xJ

    t, x

    (t)

    f

    x(t), u

    () Observation II: To obtain an optimal control tra-

    jectory{u(t) | t [0, T]via this equation, wedont need to knowxJ(t, x)forall(t, x)- onlythe time function

    p(t) = xJt, x(t), t [0, T]. It turns out that calculatingp(t)is often easierthan calculatingJ(t, x)or xJ(t, x)for all(t, x).

    Pontryagins minimum principle is just Eq. () to-gether with an equation for calculatingp(t), calledtheadjointequation.

    Also, Pontryagins minimum principle is validmuch more generally, even in cases where J(t, x)

    is not differentiable and the HJB has no solution.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    77/261

    DERIVING THE ADJOINT EQUATION

    The HJB equation holds as an identity for all(t, x), so it can be differentiated [the gradient ofthe RHS with respect to(t, x)is identically 0].

    We need a tool for differentiation of minimumfunctions.

    Lemma: LetF(t,x,u)be a continuously differen-tiable function of t , x n, and u m,and let Ube a convex subset ofm. Assumethat (t, x)is a continuously differentiable func-tion such that

    (t, x) = arg minuU

    F(t,x,u), for allt, x.

    Then

    tminuU

    F(t,x,u)

    = tFt,x,(t, x), for allt,x

    minuU

    F(t,x,u)

    = xF

    t,x,(t, x)

    , for allt,

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    78/261

    DIFFERENTIATING THE HJB EQUATION I

    We set to zero the gradient with respect to xandtof the function

    g

    x, (t, x)

    +tJ(t, x)+xJ

    t, x

    f

    x, (t, x)

    and we rely on the Lemma to disregard the termsinvolving the derivatives of(t, x)with respect totandx.

    We obtain for all(t, x),

    0 =xgx, (t, x)+2xtJ(t, x)+2xxJ

    (t, x)f

    x, (t, x)

    +xf

    x,

    (t, x)

    xJ(

    0 = 2ttJ(t, x) + 2xtJ(t, x)f

    x, (t, x)

    ,

    where xfx, (t, x)is the matrixxf=

    f1x1

    fnx1

    ......

    ...f1xn

    fnxn

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    79/261

    DIFFERENTIATING THE HJB EQUATION II

    The preceding equations hold for all(t, x). Wespecialize them along an optimal state and con-trol trajectory

    x(t), u(t)

    | t [0, T], whereu(t) =

    t, x(t)

    for allt [0, T].

    We have x(t) =fx(t), u(t),so the terms2xtJ

    t, x(t)

    + 2xxJ

    t, x(t)

    f

    x(t), u(t)

    2ttJ

    t, x(t)

    + 2xtJ

    t, x(t)

    f

    x(t), u(t)

    are equal to the total derivativesd

    dt

    xJt, x(t), ddt

    tJt, x(t),and we have

    0 =xg

    x, u(t)

    +

    d

    dt

    xJ

    t, x(t)

    +xf

    x, u

    (t)

    xJ

    t, x(t)

    0 =

    d

    dttJt, x(t).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    80/261

    CONCLUSION FROM DIFFERENTIATING THE HJB

    Definep(t) = xJ

    t, x(t)

    and

    p0(t) = tJ

    t, x(t)

    We have theadjoint equationp(t) = xf

    x(t), u(t)

    p(t)xg

    x(t), u(t)

    and

    p0(t) = 0

    or equivalently,

    p0(t) =constant, for allt [0, T].

    Note also that, by definition JT, x(T) =h

    x(T)

    , so we have the following boundary con-dition at the terminal time:

    p(T) =

    hx(T)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    81/261

    NOTATIONAL SIMPLIFICATION

    Define theHamiltonianfunctionH(x,u,p) =g(x, u) +pf(x, u)

    The adjoint equation becomes

    p(t) = xH

    x(t), u(t), p(t)

    The HJB equation becomes

    0 = minuU

    Hx(t), u , p(t) +p0(t)=H

    x(t), u(t), p(t)

    +p0(t)

    so since p0(t) =constant, there is a constant Csuch that

    H

    x(t), u(t), p(t)

    =C, for allt [0, T].

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    82/261

    PONTRYAGIN MINIMUM PRINCIPLE

    The preceding (highly informal) derivation issummarized as follows:

    Minimum Principle:Let

    u(t) | t [0, T]bean optimal control trajectory and let

    x(t) | t

    [0, T]be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equationp(t) = xH

    x(t), u(t), p(t)

    ,

    with the boundary condition

    p(T) = hx(T).Then, for allt [0, T],

    u(t) = arg minuU

    Hx(t), u , p(t).

    Furthermore, there is a constantCsuch that

    H

    x(t), u(t), p(t)

    =C, for allt [0, T].

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    83/261

    2-POINT BOUNDARY PROBLEM VIEW

    The minimum principle is a necessary conditionfor optimalityand can be used to identify candi-dates for optimality.

    We need to solve forx(t)andp(t)the differen-tial equations

    x(t) =f

    x(t), u(t)

    p(t) = xH

    x(t), u(t), p(t)

    ,

    with split boundary conditions:

    x(0) :given, p(T) = hx(T). The control trajectory is implicitly determinedfromx(t)andp(t)via the equation

    u(t) = arg minuU

    H

    x(t), u , p(t)

    .

    This 2-point boundary value problem can be

    addressed with a variety of numerical methods.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    84/261

    ANALYTICAL EXAMPLE I

    minimize T0

    1 +

    u(t)

    2dt

    subject to

    x(t) =u(t), x(0) =.

    Hamiltonian is

    H(x,u,p) =

    1 + u2 +pu,

    and adjoint equation is p(t) = 0withp(T) = 0.

    Hence,p(t) = 0 for all t [0, T], so minimizationof the Hamiltonian gives

    u(t) = arg minu

    1 + u2 = 0, for allt [0, T].

    Therefore, x(t) = 0for allt, implying thatx(t)isconstant. Using the initial condition x(0) = , itfollows thatx(t) =for allt.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    85/261

    ANALYTICAL EXAMPLE II

    Optimal production problem

    maximize

    T0

    1 u(t)x(t)dt

    subject to0 u(t) 1for allt, andx(t) =u(t)x(t), x(0)>0 :given.

    Hamiltonian: H(x,u,p) = (1 u)x +pux.

    Adjoint equation isp(t) = u(t)p(t) 1 + u(t), p(T) = 0.

    Maximization of the Hamiltonian overu [0, 1]:

    u(t) = 0 ifp(t)< 1,

    1 ifp(t) 1

    .

    Since p(T) = 0, for tclose to T, p(t) < 1/andu(t) = 0. Therefore, for t near Tthe adjoint equa-

    tion has the form p(t) = 1.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    86/261

    ANALYTICAL EXAMPLE II (CONTINUED)

    T t0

    p(t)

    T - 1/

    1/

    Fort = T 1/, p(t)is equal to 1/, sou(t)changes tou(t) = 1.

    Geometrical construction

    T t0

    p(t)

    T - 1/

    1/

    T t0 T - 1/

    u*(t)

    u*(t) = 1 u*(t) = 0

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    87/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 9

    LECTURE OUTLINE

    Deterministic continuous-time optimal control

    Variants of the Pontryagin Minimum Principle Fixed terminal state Free terminal time

    Examples

    Discrete-Time Minimum Principle

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    88/261

    REVIEW

    Continuous-time dynamic systemx(t) =f

    x(t), u(t)

    , 0 t T, x(0) :given

    Cost function

    h

    x(T)

    + T0

    g

    x(t), u(t)

    dt

    J(t, x): optimal cost-to-go fromxat timet

    HJB equation/Verification theorem: For all (t, x)

    0 = minuU

    g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)

    with the boundary conditionJ(T, x) =h(x).

    Adjoint equation/vector: To compute an op-

    timal state-control trajectory{(u(t), x(t))it isenough to know

    p(t) = xJ

    t, x(t)

    , t [0, T].

    Pontryagin theorem gives an equation forp(t).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    89/261

    NEC. CONDITION: PONTRYAGIN MIN. PRINCIPLE

    Define the Hamiltonian functionH(x,u,p) =g(x, u) +pf(x, u).

    Minimum Principle:Let u(t) | t [0, T]be an optimal control trajectory and let

    x(t) | t

    [0, T]

    be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equation

    p(t) = xHx(t), u(t), p(t),with the boundary condition

    p(T) = h

    x(T)

    .

    Then, for allt [0, T],u(t) = arg min

    uUH

    x(t), u , p(t)

    .

    Furthermore, there is a constantCsuch that

    H x t u t t = C for all t 0 T .

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    90/261

    VARIATIONS: FIXED TERMINAL STATE

    Suppose that in addition to the initial statex(0),the final statex(T)is given.

    Then the informal derivation of the adjoint equa-tionstill holds, but the terminal condition J(T, x) h(x)of the HJB equation is not true anymore.

    In effect,

    J(T, x) =

    0 ifx=x(T) otherwise.

    SoJ

    (T, x)cannot be differentiated with respecttox, and the terminal boundary conditionp(T) =hx(T) for the adjoint equation does not hold. As compensation, we have the extra condition

    x(T) :given,

    thus maintaining the balance between boundaryconditions and unknowns.

    Generalization: Some components of the ter-minal state are fixed.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    91/261

    EXAMPLE WITH FIXED TERMINAL STATE

    Consider finding the curve of minimum lengthconnecting two points(0, )and(T, ). We have

    x(t) =u(t), x(0) =, x(T) =,

    and the cost is T0 1 + u(t)2 dt.

    T t0

    x*(t)

    The adjoint equation is p(t) = 0,implying that

    p(t) =constant, for allt

    [0, T].

    Minimizing the Hamiltonian 1 + u2 +p(t)u:

    u(t) =constant, for allt [0, T].

    So optimal x(t) | t [0, T]is a straight line.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    92/261

    VARIATIONS: FREE TERMINAL TIME

    Initial state and/or the terminal state are given,but the terminal timeTis subject to optimization.

    Let x(t), u(t) | t [0, T]be an optimalstate-control trajectory pair and letT be the opti-mal terminal time. Thenx(t), u(t)would still be

    optimal ifTwere fixed atT, so

    u(t) = arg minuU

    H

    x(t), u , p(t)

    , for all t [0, T

    wherep(t)is given by the adjoint equation.

    In addition: H(x(t), u(t), p(t) ) = 0for all t[instead ofH(x(t), u(t), p(t)) constant]. Justification: We have

    tJt, x(t)t=0 = 0

    Along the optimal, the HJB equation is

    tJ

    t, x(t)

    = Hx(t), u(t), p(t), for alltsoHx(0), u(0), p(0)= 0.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    93/261

    MINIMUM-TIME EXAMPLE I

    Unit mass moves horizontally: y(t) = u(t),wherey(t): position,u(t): force,u(t) [1, 1]. Given the initial position-velocity (y(0), y(0)),bring the object to (y(T), y(T)) = (0, 0)so thatthe time of transfer is minimum. Thus, we want to

    minimizeT =

    T0

    1dt.

    Let the state variables be

    x1(t) =y(t), x2(t) = y(t),

    so the system equation is

    x1(t) =x2(t), x2(t) =u(t).

    Initial state x1(0), x2(0): given andx1(T) = 0, x2(T) = 0.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    94/261

    MINIMUM-TIME EXAMPLE II

    If u(t) | t [0, T]is optimal,u(t)must min-imize the Hamiltonian for eacht, i.e.,

    u(t) = arg min1u1

    1 +p1(t)x2(t) +p2(t)u

    .

    Therefore

    u(t) =

    1 ifp2(t)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    95/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    96/261

    MINIMUM-TIME EXAMPLE IV

    For intervals whereu(t) 1, the system movesalong the curves

    x1(t) 12

    x2(t)

    2

    : constant.

    For intervals where u(t) 1, the systemmoves along the curves

    x1(t) +1

    2x2(t)2

    : constant.

    x1

    x2

    u(t)1

    0

    (a)

    x

    x2

    0

    u(t)-1

    (b)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    97/261

    MINIMUM-TIME EXAMPLE V

    To bring the system from the initial state x(0)to the origin with at most one switch, we use thefollowing switching curve.

    x1

    x2

    u*(t)1

    u*(t) -1

    0

    (x1(0),x2(0))

    (a) If the initial state lies abovethe switching curve,use u(t) 1 until the state hits the switch-ing curve; then useu(t) 1.

    (b) If the initial state lies belowthe switching curve,useu(t)1until the state hits the switch-ing curve; then useu(t) 1.

    (c) If the initial state lies on the top (bottom)

    part of the switching curve, use u(t) 1[u(t)

    1, respectively].

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    98/261

    DISCRETE-TIME MINIMUM PRINCIPLE

    Minimize J(u) = gN(xN) +N1k=0 gk(xk, uk),subject touk Uk m, withUk: convex, and

    xk+1 =fk(xk, uk), k= 0, . . . , N 1, x0 : given.

    Introduce Hamiltonian function

    Hk(xk, uk, pk+1) =gk(xk, uk) +pk+1fk(xk, uk)

    Suppose{(uk, xk+1) | k = 0, . . . , N 1}areoptimal. Then for allk,

    ukHkxk, uk, pk+1(ukuk) 0, for alluk Ukwherep1, . . . , pNare obtained from

    pk = xkfk pk+1+ xkgk,

    with the terminal conditionpN = gN(xN). If, in addition, the HamiltonianHkis a convexfunction ofukfor any fixedxkandpk+1, we have

    uk

    = arg minukUk

    Hkxk, uk, pk+1, for allk.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    99/261

    DERIVATION

    We develop an expression for the gradientJ(u).We have, using the chain rule,

    ukJ(u) =ukfk xk+1 fk+1 xN1 fN1 gN

    +ukfk xk+1 fk+1 xN2 fN2 xN1 gN

    +ukfk xk+1 gk+1

    +ukgk,

    where all gradients are evaluated along u and thecorresponding state trajectory.

    Iintroduce the discrete-time adjoint equation

    pk = xkfk pk+1+ xkgk, k= 1, . . . , N 1,

    with terminal conditionpN = gN. Verify that, for allk,

    ukJ(u0, . . . , uN1) = ukHk(xk, uk, pk+1)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    100/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 10

    LECTURE OUTLINE

    Problems with imperfect state info

    Reduction to the perfect state info case Machine repair example

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    101/261

    BASIC PROBLEM WITH IMPERFECT STATE INFO

    Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowingxk,receives at each time kan observation of the form

    z0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1

    The observationzkbelongs to some spaceZk.The random observation disturbance vkis char-acterized by a probability distribution

    Pvk ( | xk, . . . , x0, uk1, . . . , u0, wk1, . . . , w0, vk1, . . . , v0)

    The initial state x0is also random and charac-terized by a probability distributionPx0 .

    The probability distributionPwk( | xk, uk)ofwkis given, and it may depend explicitly on xkandukbut not onw0, . . . , wk1, v0, . . . , vk1.

    The controlukis constrained to a given subsetUk(this subset does not depend on xk, which isnot assumed known).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    102/261

    INFORMATION VECTOR AND POLICIES

    Denote by Ikthe information vector, i.e., theinformation available at timek:

    Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k 1,I0 =z0.

    We consider policies ={0, 1, . . . , N1},where each function kmaps the information vec-torIkinto a controlukand

    k(Ik) Uk, for allIk, k 0.

    We want to find a policythat minimizes

    J = Ex0,wk,vk

    k=0,...,N1

    gN(xN) +

    N1k=0

    gk

    xk, k(Ik), wk

    subject to the equations

    xk+1 =fk

    xk, k(Ik), wk

    , k 0,

    z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    103/261

    EXAMPLE: MULTIACCESS COMMUNICATION I

    Collection of transmitting stations sharing a com-mon channel, are synchronized to transmit pack-ets of data at integer times.

    xk: backlog at the beginning of slot k.

    a

    k: random number of packet arrivals in slotk.

    tk: the number of packets transmitted in slotk.

    xk+1 =xk+ ak tk,

    At kth slot, each of the xkpackets in the systemis transmitted with probability uk(common for allpackets). If two or more packets are transmitted

    simultaneously, they collide.

    Sotk = 1(a success) with probabilityxkuk(1

    uk)xk1, andtk = 0(idle or collision) otherwise. Imperfect state info: The stations can observethe channel and determine whether in any oneslot there was a collision (two or more packets), a

    success (one packet), or an idle (no packets).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    104/261

    EXAMPLE: MULTIACCESS COMMUNICATION II

    Information vector at timek: The entire history(up to k) of successes, idles, and collisions (aswell as u0, u1, . . . , uk1). Mathematically, zk+1,the observation at the end of thekth slot, is

    zk+1 =vk+1

    where vk+1 yields an idle with probability (1uk)xk , a success with probability xkuk(1uk)xk1,and a collision otherwise.

    If we had perfect state information, the DP al-gorithm would be

    Jk(xk) =gk(xk)+ min0uk1

    Eak

    p(xk, uk)Jk+1(xk+ ak

    + 1p(xk, uk)Jk+1(xk+ ak),p(xk, uk) is the success probability xkuk(1uk)xk1 The optimal (perfect state information) policywould be to select the value ofukthat maximizesp(xk, uk), sok(xk) =

    1xk

    , for allxk

    1.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    105/261

    REFORMULATION AS A PERFECT INFO PROBLEM

    We haveIk+1 = (Ik, zk+1, uk), k= 0, 1, . . . , N 2, I0 =z0.

    View this as a dynamic system with stateIk, con-troluk, and random disturbancezk+1.

    We have

    P(zk+1| Ik, uk) =P(zk+1| Ik, uk, z0, z1, . . . , zk),

    since z0, z1, . . . , zkare part of the information vec-

    tor Ik. Thus the probability distribution of zk+1depends explicitly only on the stateIkand controlukand not on the prior disturbanceszk, . . . , z0.

    Write

    E

    gk(xk, uk, wk)

    =E

    Exk,wk

    gk(xk, uk, wk) | Ik, uk

    so the cost per stage of the new system is

    gk(Ik, uk) = Exk,wkgk(xk, uk, wk) | Ik, uk

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    106/261

    DP ALGORITHM

    Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:

    Jk(Ik) = minukUk

    E

    xk,wk, zk+1

    gk(xk, uk, wk)

    + Jk+1(Ik, zk+1, uk) | Ik, u

    fork= 0, 1, . . . , N 2, and fork=N 1,

    JN1(IN1) = minuN1UN1

    ExN1, wN1

    gN

    fN1(xN1, uN1, wN1)

    + gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by

    J =Ez0J0(z0).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    107/261

    MACHINE REPAIR EXAMPLE I

    A machine can be in one of two states denotedP(good state) andP(bad state).

    At the end of each period the machine is in-spected.

    Two possible inspection outcomes: G (probably

    good state) andB(probably bad state).

    Transition probabilities:P P G

    B

    1/4

    1/3

    2/3 3/4

    3/41

    1/4

    P P

    State Transition Inspection

    Possible actions after each inspection:C: Continue operation of the machine.

    S: Stop the machine, determine its state, and ifinPbring it back to the good stateP.

    Cost per stage:g(P, C) = 0, g(P, S) = 1, g(P , C) = 2, g(P , S) = 1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    108/261

    MACHINE REPAIR EXAMPLE II

    The information vector at times0and1isI0 =z0, I1 = (z0, z1, u0),

    and we seek functions 0(I0), 1(I1) that minimize

    Ex0, w0, w1v0, v1

    g

    x0, 0(z0)

    +g

    x1, 1(z0, z1, 0(z0))

    .

    DP algorithm: Start with J2(I2) = 0. For k =0, 1, take the min over the two actions, C and S,

    Jk(Ik) = min

    P(xk =P | Ik)g(P, C)

    + P(xk =P | Ik)g(P , C)+ E

    zk+1Jk+1(Ik, C , zk+1) | Ik, C,P(xk =P | Ik)g(P, S)

    + P(xk =P | Ik)g(P , S)

    + Ezk+1Jk+1(Ik, S , zk+1) | Ik, S

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    109/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    110/261

    MACHINE REPAIR EXAMPLE IV

    (2) ForI1 = (B,G,S)

    P(x1 =P | B,G,S) =P(x1 =P | G,G,S) =7

    J1(B,G,S) =

    2

    7 , 1(B,G,S) =C.

    (3) ForI1 = (G,B,S)

    P(x1 =P|

    G, B|

    S) =P(x1 =P , G, B, S )

    P(G, B| S)=

    13 34

    23 34 + 13 14

    23 14 + 13 34

    23 34 + 13

    =3

    5,

    J1(G,B,S) = 1, 1(G,B,S) =S.

    Similarly, for all possibleI1, we computeJ1(I1),and 1(I1), which is to continue (u1 = C)if the

    last inspection wasG, and to stop otherwise.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    111/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    112/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 11

    LECTURE OUTLINE

    Review of DP for imperfect state info

    Linear quadratic problems Separation of estimation and control

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    113/261

    REVIEW: PROBLEM WITH IMPERFECT STATE INF

    Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1

    Ik: information vector available at timek:

    I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k

    Optimization over policies = {0, 1, . . . , N1}wherek(Ik)

    Uk, for allIkand k.

    Find a policythat minimizes

    J = Ex0,wk,vk

    k=0,...,N1

    gN(xN) +

    N1k=0

    gk

    xk, k(Ik), wk

    subject to the equations

    xk+1 =fk

    xk, k(Ik), wk

    , k 0,

    z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    114/261

    DP ALGORITHM

    Reformulate to perfect state info problem, andwrite the DP algorithm:

    Jk(Ik) = minukUk

    E

    xk,wk, zk+1

    gk(xk, uk, wk)

    + Jk+1(Ik, zk+1, uk) | Ik, u

    fork= 0, 1, . . . , N 2, and fork=N 1,

    JN1(IN1) = minuN1UN1

    ExN1, wN1

    gN

    fN1(xN1, uN1, wN1)

    + gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by

    J =Ez0J0(z0).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    115/261

    LINEAR-QUADRATIC PROBLEMS

    System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost

    Ewk

    k=0,1,...,N1 xNQNxN+

    N1

    k=0(xkQkxk+ u

    kRkuk)

    whereQk 0andRk >0. Observations

    zk =Ckxk+ vk, k= 0, 1, . . . , N 1. w0, . . . , wN1,v0, . . . , vN1indep. zero mean Key fact to show:

    Optimal policy {0, . . . , N1} is of the form:

    k(Ik) =LkE{xk | Ik}

    Lk: same as for the perfect state info case

    Estimation problem and control problem canbe solved separately

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    116/261

    DP ALGORITHM I

    Last stageN 1(supressing indexN 1):JN1(IN1) = min

    uN1

    ExN1,wN1

    xN1QxN1

    + uN1RuN1+ (AxN1+ BuN1+ wN1)

    Q(AxN1+ BuN1+ wN1) | IN1, uN1 Since E{wN1| IN1} = E{wN1} = 0, theminimization involves

    minuN1

    uN1(B

    QB+ R)uN1

    + 2E{xN1 | IN1}AQBuN1

    The minimization yields the optimalN1:

    uN1 =N1(IN1) =LN1E{xN1| IN1}

    where

    LN1 = (B

    QB+ R)1

    B

    QA

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    117/261

    DP ALGORITHM II

    Substituting in the DP algorithmJN1(IN1) = E

    xN1

    xN1KN1xN1| IN1

    + E

    xN1xN1 E{xN1| IN1}

    PN1xN1 E{xN1| IN1} | IN+ E

    wN1

    {wN1QNwN1},

    where the matricesKN1andPN1are given by

    PN1 =AN1QNBN1(RN1+ BN1QNBN1)

    BN1QNAN1,KN1 =AN1QNAN1 PN1+ QN1.

    Note the structure ofJN1: in addition to thequadratic and constant terms, it involves a quadraticin the estimation error

    xN1 E{xN1| IN1}

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    118/261

    DP ALGORITHM III

    DP equation for periodN 2:JN2(IN2) = min

    uN2

    E

    xN2,wN2,zN1

    {xN2QxN2

    + uN2RuN2+ JN1(IN1) | IN2, uN2}=E

    xN2QxN2 | IN2

    + min

    uN2

    uN2RuN2

    + ExN1KN1xN1 | IN2, uN2

    + E

    xN1 E{xN1 | IN1}

    PN1

    xN1 E{xN1 | IN1} | IN2, uN2

    + EwN1{wN1QNwN1}.

    Key point: We have excluded the next to lastterm from the minimization with respect touN2.

    This term turns out to be independent ofuN2.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    119/261

    QUALITY OF ESTIMATION LEMMA

    For everyk, there is a functionMksuch that wehave

    xkE{xk | Ik} =Mk(x0, w0, . . . , wk1, v0, . . . , vk),

    independently of the policy being used. The following simplified version of the lemmaconveys the main idea.

    Simplified Lemma: Let r, u, zbe random vari-ables such thatrand uare independent, and let

    x=r+ u. Then

    x E{x | z, u} =r E{r | z}.

    Proof: We have

    x E{x | z, u} =r+ u E{r+ u | z, u}=r+ u E{r | z, u} u=r E{r | z, u}=r

    E

    {r|

    z}

    .

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    120/261

    APPLYING THE QUALITY OF ESTIMATION LEMMA

    Using the lemma,xN1 E{xN1| IN1} =N1,

    where

    N1: function ofx0, w0, . . . , wN2, v0, . . . , vN1

    SinceN1is independent ofuN2, the condi-tional expectation ofN1PN1N1satisfies

    E{N1PN1N1| IN2, uN2}=E{N1PN1N1| IN2}

    and is independent ofuN2.

    So minimization in the DP algorithm yields

    uN2 =N2(IN2) =LN2E{xN2| IN2}

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    121/261

    FINAL RESULT

    Continuing similarly (using also the quality ofestimation lemma)

    k(Ik) =LkE{xk| Ik},

    whereLkis the same as for perfect state info:

    Lk = (Rk+ BkKk+1Bk)1BkKk+1Ak,

    withKkgenerated fromKN =QN,using

    Kk =AkKk+1Ak Pk+ Qk,

    Pk =AkKk+1Bk(Rk+ BkKk+1Bk)

    1BkKk+1Ak

    xk + 1= Akxk+ Bkuk+wk

    Lk

    uk

    wk

    xkzk= Ckxk+ vk

    Delay

    Estimator

    E{xk|Ik}uk - 1

    zk

    vk

    zkuk

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    122/261

    SEPARATION INTERPRETATION

    The optimal controller can be decomposed into(a) Anestimator, which uses the data to gener-

    ate the conditional expectationE{xk| Ik}.(b) Anactuator, which multipliesE{xk| Ik} by

    the gain matrix Lkand applies the control

    inputuk =LkE{xk| Ik}. Generically the estimatexof a random vectorxgiven some information (random vector)I, whichminimizes the mean squared error

    Ex{x x2 | I} = x2 2E{x | I}x + x2

    isE{x|I} (set to zero the derivative with respectto xof the above quadratic form).

    The estimator portion of the optimal controlleris optimal for the problem of estimating the statexkassuming the control is not subject to choice.

    The actuator portion is optimal for the controlproblem assuming perfect state information.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    123/261

    STEADY STATE/IMPLEMENTATION ASPECTS

    AsN , the solution of the Riccati equationconverges to a steady state andLk L. If x0, wk, and vkare Gaussian, E{xk| Ik}isalinearfunction ofIkand is generated by a nicerecursive algorithm, the Kalman filter.

    The Kalman filter involves also a Riccati equa-tion, so for N , and a stationary system, italso has a steady-state structure.

    Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.For nonGaussian uncertainty, computing E{xk | Ikmaybe very difficult, so a suboptimal solution is

    typically used.

    Most common suboptimal controller: Replace

    E{xk | Ik} by the estimateproducedby the Kalmanfilter (act as ifx0,wk, andvkare Gaussian).

    It can be shown that this controller is optimalwithin the class of controllers that arelinearfunc-tions ofIk.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    124/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 12

    LECTURE OUTLINE

    DP for imperfect state info

    Sufficient statisticsConditional state distribution as a sufficient statis-tic

    Finite-state systems

    Examples

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    125/261

    REVIEW: PROBLEM WITH IMPERFECT STATE INF

    Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 0

    Ik: information vector available at timek:

    I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k

    Optimization over policies = {0, 1, . . . , N1}wherek(Ik)

    Uk, for allIkand k.

    Find a policythat minimizes

    J = Ex0,wk,vk

    k=0,...,N1

    gN(xN) +

    N1k=0

    gk

    xk, k(Ik), wk

    subject to the equations

    xk+1 =fk

    xk, k(Ik), wk

    , k 0,

    z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    126/261

    DP ALGORITHM

    DP algorithm:Jk(Ik) = min

    ukUk

    E

    xk,wk, zk+1

    gk(xk, uk, wk)

    + Jk+1(Ik, zk+1, uk) | Ik, u

    fork= 0, 1, . . . , N 2, and fork=N 1,

    JN1(IN1) = minuN1UN1

    E

    xN1, wN1

    gN

    fN1(xN1, uN1, wN1)

    + gN1(xN1, uN1, wN1) | IN1, uN1

    The optimal costJ is given by

    J =Ez0

    J0(z0)

    .

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    127/261

    SUFFICIENT STATISTICS

    Suppose that we can find a function Sk(Ik) suchthat the right-hand side of the DP algorithm canbe written in terms of some functionHkas

    minukUk

    HkSk(Ik), uk.Such a function Skis called a sufficient statistic. An optimal policy obtained by the precedingminimization can be written as

    k(Ik) =k

    Sk(Ik)

    ,

    wherekis an appropriate function.

    Example of a sufficient statistic: Sk(Ik) =Ik Another important sufficient statistic

    Sk(Ik) =Pxk|Ik

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    128/261

    DP ALGORITHM IN TERMS OFPXK |IK

    It turns out thatPxk|Ikis generated recursivelyby a dynamic system (estimator) of the form

    Pxk+1|Ik+1 = k

    Pxk|Ik , uk, zk+1

    for a suitable functionk DP algorithm can be written as

    Jk(Pxk|Ik) = minukUk

    E

    xk,wk,zk+1gk(xk, uk, wk)

    + Jk+1k(Pxk|Ik , uk, zk+1) | Ik, uk

    uk xk

    Delay

    Estimator

    uk - 1

    uk - 1

    vk

    zk

    zk

    wk

    k - 1

    Actuator

    xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)

    System Measurement

    P xk

    | Ik

    k

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    129/261

    EXAMPLE: A SEARCH PROBLEM

    At each period, decide to search or not searcha site that may contain a treasure.

    If we search and a treasure is present, we findit with prob.and remove it from the site.

    Treasures worth: V. Cost of search: C

    States: treasure present & treasure not present Each search can be viewed as an observationof the state

    Denote

    pk : prob. of treasure present at the start of timek

    withp0given.

    p

    kevolves at timekaccording to the equation

    pk+1 =

    pk if not search,0 if search and find treasur

    pk(1)pk(1)+1pk

    if search and no treasure

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    130/261

    SEARCH PROBLEM (CONTINUED)

    DP algorithm

    Jk(pk) = max

    0,C+pkV

    + (1

    pk)Jk+1

    pk(1 )

    pk(1 ) + 1 pk,

    withJN(pN) = 0.

    Can be shown by induction that the functionsJksatisfy

    Jk(pk) = 0, for allpk CV

    Furthermore, it is optimal to search at periodkif and only if pkV C(expected reward from the next search the costof the search)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    131/261

    FINITE-STATE SYSTEMS

    Suppose the system is a finite-state Markovchain, with states1, . . . , n.

    Then the conditional probability distribution Pxk|Ikis a vector

    P(xk = 1 | Ik), . . . , P (xk =n | Ik) The DP algorithm can be executed over then-dimensional simplex (state space is not expandingwith increasingk)

    When the control and observation spaces arealso finite sets, it turns out that the cost-to-go func-

    tionsJkin the DP algorithm are piecewise linearand concave (Exercise 5.7).

    This is conceptually important and also (mod-erately) useful in practice.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    132/261

    INSTRUCTION EXAMPLE

    Teaching a student some item. Possible statesareL: Item learned, orL: Item not learned.

    Possible decisions: T: Terminate the instruc-tion, orT: Continue the instruction for one periodand then conduct a test that indicates whether the

    student has learned the item.

    The test has two possible outcomes: R: Studentgives a correct answer, or R: Student gives anincorrect answer.

    Probabilistic structureL L R

    rt

    1 1

    1 - r1 - tL RL

    Cost of instruction is Iper period Cost of terminating instruction; 0 if student haslearned the item, andC >0if not.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    133/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    134/261

    INSTRUCTION EXAMPLE III

    Write the DP algorithm asJk(pk) = min

    (1 pk)C, I+ Ak(pk)

    ,

    where

    Ak(pk) =P(zk+1 =R | Ik)Jk+1(pk, R)+ P(zk+1 =R | Ik)Jk+1

    (pk, R)

    Can show by induction that Ak(p) are piecewiselinear, concave, monotonically decreasing, with

    Ak1(p) Ak(p) Ak+1(p), for allp [0, 1].

    0 p

    C

    I

    I+ AN - 1(p)

    I+ AN - 2(p)

    I+ AN - 3(p)

    1N- 1 N- 3N- 2 1 -

    I

    C

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    135/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 13

    LECTURE OUTLINE

    Suboptimal control

    Certainty equivalent control Implementations and approximations Issues in adaptive control

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    136/261

    PRACTICAL DIFFICULTIES OF DP

    The curse of modeling The curse of dimensionality

    Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

    Quick explosion of the number of states incombinatorial problems

    Intractability of imperfect state informationproblems

    There may be real-time solution constraints A family of problems may be addressed. Thedata of the problem to be solved is given with

    little advance notice

    The problem data may change as the systemis controlled need for on-line replanning

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    137/261

    CERTAINTY EQUIVALENT CONTROL (CEC)

    Replace the stochastic problem with a deter-ministic problem

    At each time k, the uncertain quantities are fixedat some typical values

    Implementation for an imperfect info problem.

    At each timek:

    (1) Compute a state estimate xk(Ik)given thecurrent information vectorIk.

    (2) Fix thewi, i

    k, at somewi(xi, ui). Solve

    the deterministic problem:

    minimize gN(xN)+

    N1i=k

    gi

    xi, ui, wi(xi, ui)

    subject toxk =xk(Ik)and fori k,

    ui Ui, xi+1 =fi

    xi, ui, wi(xi, ui)

    .

    (3) Use as control the first element in the optimal

    control sequence found.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    138/261

    ALTERNATIVE IMPLEMENTATION

    Let d0(x0), . . . , dN1(xN1)be an optimalcontroller obtained from the DP algorithm for thedeterministic problem

    minimize gN(xN) +

    N1

    k=0 gkxk, k(xk), wk(xk, uk)subject to xk+1 =fk

    xk, k(xk), wk(xk, uk)

    , k(xk)

    The CEC applies at timekthe control input

    k(Ik) =dkxk(Ik)

    xk

    Delay

    Estimator

    uk - 1

    uk - 1

    vk

    zk

    zk

    wk

    Actuator

    xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)

    System Measurement

    kd

    u k =kd(xk)

    xk(Ik)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    139/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    140/261

    PARTIALLY STOCHASTIC CEC

    Instead of fixingallfuture disturbances to theirtypical values, fix only some, and treat the rest asstochastic.

    Important special case: Treat an imperfect stateinformation problem as one of perfect state infor-

    mation, using an estimate xk(Ik) of xkas if it wereexact.

    Multiaccess Communication Example: Con-sider controlling the slotted Aloha system (dis-cussed in Ch. 5) by optimally choosing the prob-

    ability of transmission of waiting packets. This isa hard problem of imperfect state info, whose per-

    fect state info version is easy.

    Natural partially stochastic CEC:

    k(Ik) = min

    1, 1xk(Ik)

    ,

    wherexk(Ik)is an estimate of the current packetbacklog based on the entire past channel history

    of successes, idles, and collisions (which isIk).

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    141/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    142/261

    THE PROBLEM OF IDENTIFIABILITY

    Suppose we consider two phases: A parameter identification phase (compute

    an estimate of)

    A control phase (apply control that would beoptimal if were true).

    A fundamental difficulty: the control processmay make some of the unknown parameters in-

    visible to the identification process.

    Example: Consider the scalar systemxk+1 =axk+ buk+ wk, k= 0, 1, . . . , N 1,with the cost E

    Nk=1(xk)

    2

    . If a and b are known,

    the optimal control law isk(xk) = (a/b)xk.

    If aand bare not known and we try to esti-

    mate them while applying some nominal controllawk(xk) =xk, the closed-loop system is

    xk+1 = (a + b)xk+ wk,

    so identification can at best find (a+b)but not

    the values of bothaandb.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    143/261

    CEC AND IDENTIFIABILITY I

    Suppose we have P{xk+1 | xk, uk, }and weuse a control law that is optimal for known:

    k(Ik) =k(xk,k), with k: estimate of

    There are three systems of interest:(a) The system (perhaps falsely) believed by thecontroller to be true, which evolves proba-

    bilistically according to

    Pxk+1 | xk, (xk,k),k.(b) The true closed-loop system, which evolves

    probabilistically according to

    Pxk+1 | xk, (xk,k), .

    (c) The optimal closed-loop system that corre-sponds to the true value of the parameter,

    which evolves probabilistically according to

    Pxk+1 | xk, (xk, ), .

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    144/261

    CEC AND IDENTIFIABILITY II

    System Believed to beTrue

    P{xk +1|xk,*(xk, k), k

    }

    Optimal Closed-Loop System

    P{xk +1|xk,*(xk,),}

    True Closed-Loop System

    P{xk +1|xk,*(xk, k),}

    ^

    ^

    ^

    Difficulty: There is a built-in mechanism forthe parameter estimates to converge to a wrong

    value.

    Assume that for some =and allxk+1,xk,

    P

    xk+1 | xk, (xk,),

    =P

    xk+1 | xk, (xk,),

    i.e., there is a false value of parameter for whichthe system under closed-loop control looks ex-actly as if the false value were true.

    Then, if the controller estimates at some timethe parameter to be , subsequent data will tend

    to reinforce this erroneous estimate.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    145/261

    REMEDY TO IDENTIFIABILITY PROBLEM

    Introduce noise in the control applied, i.e., oc-casionally deviate from the CEC actions.

    This provides a means to escape from wrongestimates.

    However, introducing noise in the control may

    be difficult to implement in practice.

    Under some special circumstances, i.e., theself-tuning control context discussed in the book,the CEC is optimal in the limit, even if the param-

    eter estimates converge to the wrong values. All of this touches upon some of the most so-phisticated aspects of adaptive control.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    146/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 14

    LECTURE OUTLINE

    Limited lookahead policies

    Performance bounds Computational aspects Problem approximation approach

    Vehicle routing example

    Heuristic cost-to-go approximation Computer chess

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    147/261

    LIMITED LOOKAHEAD POLICIES

    One-step lookahead (1SL) policy: At each kandstatexk, use the controlk(xk)that

    minukUk(xk)

    E

    gk(xk, uk, wk)+Jk+1

    fk(xk, uk, wk)

    ,

    where

    JN =gN. Jk+1: approximation to true cost-to-goJk+1

    Two-step lookahead policy: At each kand xk,use the controlk(xk) attaining the minimum above,where the function Jk+1is obtained using a 1SLapproximation (solve a 2-step DP problem).

    If Jk+1is readily available and the minimizationabove is not too hard, the 1SL policy is imple-

    mentable on-line.Sometimes one also replaces Uk(xk) above witha subset of most promising controlsUk(xk).

    As the length of lookahead increases, the re-quired computation quickly explodes.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    148/261

    PERFORMANCE BOUNDS

    LetJk(xk)be the cost-to-go from(xk, k)of the1SL policy, based on functions Jk.

    Assume that for all(xk, k), we have

    Jk(xk)

    Jk(xk), (*)

    where JN =gNand for allk,

    Jk(xk) = minukUk(xk)

    E

    gk(xk, uk, wk)

    + Jk+1fk(xk, uk, wk),

    [so Jk(xk)is computed along withk(xk)]. Then

    Jk(xk) Jk(xk), for all(xk, k).

    Important application: When Jkis the cost-to-

    go of some heuristic policy (then the 1SL policy is

    called therolloutpolicy).

    The bound can be extended to the case wherethere is akin the RHS of (*). Then

    Jk(xk) Jk(xk) + k+ + N1

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    149/261

    COMPUTATIONAL ASPECTS

    Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly whenUk(xk)is not a discrete set]. Con-nection with the methodology of stochastic pro-gramming.

    The choice of the approximating functions Jkiscritical, and is calculated with a variety of methods.

    Some approaches:(a) Problem Approximation: Approximate the op-

    timal cost-to-go with some cost derived froma related but simpler problem

    (b) Heuristic Cost-to-Go Approximation: Approx-imate the optimal cost-to-go with a functionof a suitable parametric form, whose param-

    eters are tuned by some heuristic or system-atic scheme (Neuro-Dynamic Programming)

    (c) Rollout Approach: Approximate the optimalcost-to-go with the cost of some suboptimal

    policy, which is calculated either analytically

    or by simulation

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    150/261

    PROBLEM APPROXIMATION

    Many (problem-dependent) possibilities Replace uncertain quantities by nominal val-

    ues, or simplify the calculation of expected

    values by limited simulation

    Simplify difficult constraints or dynamics

    Example of enforced decomposition: Route mvehicles that move over a graph. Each node has

    a value. The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints

    (plus time windows and other constraints).

    Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.

    1SL scheme: At time kand state xk(positionof vehicles and collected value nodes), considerall possiblekth moves by the vehicles, and at theresulting states we approximate the optimal value-

    to-go with the value collected by optimizing thevehicle routes one-at-a-time

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    151/261

    HEURISTIC COST-TO-GO APPROXIMATION

    Use a cost-to-go approximation from a paramet-ric class J(x, r)where xis the current state andr = (r1, . . . , rm)is a vector of tunable scalars(weights).

    By adjusting the weights, one can change the

    shape of the approximation Jso that it is reason-ably close to the true optimal cost-to-go function.

    Two key issues: The choice of parametric class J(x, r)(the

    approximation architecture).

    Method for tuning the weights (training thearchitecture).

    Successful application strongly depends on howthese issues are handled, and on insight about the

    problem. Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    152/261

    APPROXIMATION ARCHITECTURES

    Divided in linear and nonlinear [i.e., linear ornonlinear dependence of J(x, r)onr].

    Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

    Architectures based on feature extraction

    Feature Extraction

    MappingCost Approximator w/

    Parameter Vector r

    FeatureVector yStatex

    Cost Approximation

    J (y,r )

    Ideally, the features will encode much of thenonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.

    Sometimes the state space is partitioned, andlocal features are introduced for each subset ofthe partition (they are 0 outside the subset).

    With a well-chosen feature vectory(x), we canuse a linear architecture

    J(x, r) = Jy(x), r= i

    riyi(x)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    153/261

    COMPUTER CHESS I

    Programs use a feature-based position evalua-tor that assigns a score to each move/position

    FeatureExtraction

    Weightingof Features

    Score

    Features:Material balance,Mobility,Safety, etc

    Position Evaluator

    Most often the weighting of features is linear butmultistep lookahead is involved.

    Most often the training is done by trial and error.

    Additional features: Depth first search Variable depth search when dynamic posi-

    tions are involved

    Alpha-beta pruning

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    154/261

    COMPUTER CHESS II

    Multistep lookahead treeP (White to Move)

    M2

    (+16)

    (+16) (+20)

    (+8) (+16) (+20) (+8)

    8 +20 +18 +16 +24 +20 +10 +12 -4 +8 +21 +11 -5 +10 +32 +27 +10 +9 +3

    (+16)

    (+11)

    (+11)

    (+11) Black to

    Move

    Black to Move

    White to Mov

    M1

    P 2

    P 1

    P 3

    P 4

    CutoffCutoff

    Cutoff

    Cutoff

    Alpha-beta pruning: As the move scores areevaluated by depth-first search, branches whoseconsideration (based on the calculations so far)

    cannot possibly change the optimal move are ne-glected

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    155/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 15

    LECTURE OUTLINE

    Rollout algorithms

    Cost improvement property Discrete deterministic problems Sequential consistency and greedy algorithms

    Sequential improvement

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    156/261

    ROLLOUT ALGORITHMS

    One-step lookahead policy:At each kandstatexk, use the controlk(xk)that

    minukUk(xk)

    E

    gk(xk, uk, wk)+Jk+1

    fk(xk, uk, wk)

    ,

    where JN =gN. Jk+1: approximation to true cost-to-goJk+1

    Rollout algorithm: When Jkis the cost-to-goof some heuristic policy (called thebase policy)

    Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usually

    much better) cost than the base heuristic startingfrom the same state.

    Main difficulty: Calculating Jk(xk)may be com-putationally intensive if the cost-to-go of the base

    policy cannot be analytically calculated.

    May involve Monte Carlo simulation if theproblem is stochastic.

    Things improve in the deterministic case.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    157/261

    EXAMPLE: THE QUIZ PROBLEM

    A person is givenNquestions; answering cor-rectly questionihas probabilitypi, with rewardvi.

    Quiz terminates at the first incorrect answer. Problem: Choose the ordering of questions soas to maximize the total expected reward.

    Assuming no other constraints, it is optimal touse the index policy: Questions should be an-swered in decreasing order of the index of pref-erencepivi/(1 pi). With minor changes in the problem, the indexpolicy need not be optimal. Examples:

    A limit (< N) on the maximum number ofquestions that can be answered.

    Time windows, sequence-dependent rewards

    precedence constraints.

    Rollout with the index policy as base policy:Convenient because at a given state (subset ofquestions already answered), the index policy and

    its expected reward can be easily calculated.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    158/261

    COST IMPROVEMENT PROPERTY

    LetJk(xk): Cost-to-go of the rollout policy

    Hk(xk): Cost-to-go of the base policy

    We claim thatJk(xk) Hk(xk)for allxkand kProof by induction: We have JN(xN) =HN(xN)for allxN. Assume that

    Jk+1(xk+1) Hk+1(xk+1), xk+1.Then, for allxk

    Jk(xk) =E

    gk

    xk, k(xk), wk

    + Jk+1

    fk

    xk, k(xk), wk

    Egkxk, k(xk), wk+ Hk+1fkxk, k(xk), wkE

    gk

    xk, k(xk), wk

    + Hk+1

    fk

    xk, k(xk), wk

    =Hk(xk)

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    159/261

    EXAMPLE: THE BREAKTHROUGH PROBLEM

    root

    Given a binary tree withNstages.

    Each arc is either free or is blocked (crossedout in the figure).

    Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).

    Base heuristic (greedy): Follow the right branchif free; else follow the left branch if free.

    For large Nand given prob. of free branch:the rollout algorithm requires O(N)times morecomputation, but has O(N)times larger prob. of

    finding a free path than the greedy algorithm.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    160/261

    DISCRETE DETERMINISTIC PROBLEMS

    Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be rep-resented as a sequential decision process by us-

    ing a tree.

    The leaves of the tree correspond to the feasible

    solutions.

    The problem can be solved by DP, starting fromthe leaves and going back towards the root.

    Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach ofNcities.

    ABC ABD ACB ACD ADB ADC

    ABCD

    AB AC AD

    ABDC ACBD ACDB ADBC ADCB

    Origin Node sA

    Traveling salesman problem with four cities A, B, C, D

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    161/261

    A CLASS OF GENERAL DISCRETE PROBLEMS

    Generic problem: Given a graph with directed arcs A special nodescalled theorigin A set of terminal nodes, calleddestinations,

    and a costg(i)for each destinationi.

    Find min cost path starting at the origin, end-ing at one of the destination nodes.

    Base heuristic: For any nondestination nodei,constructs a path(i, i1, . . . , im, i)starting atiand

    ending at one of the destination nodes i. We callitheprojectionofi, and we denoteH(i) =g(i).

    Rollout algorithm: Start at the origin; choosethe successor node with least cost projection

    s i1 im

    j1

    j2

    j3

    j4

    p(j1)

    p(j2)

    p(j3)

    p(j4)

    im-1

    Neighbors of imProjections of

    Neighbors of im

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    162/261

    EXAMPLE: ONE-DIMENSIONAL WALK

    A person takes either a unit step to the left or aunit step to the right. Minimize the costg(i)of thepointiwhere he will end up afterNsteps.

    g(i)

    iNN- 2-N 0

    (N,0)

    (0,0)

    (N,-N) (N,N)

    i_

    i_

    Base heuristic: Always go to the right. Rolloutfinds the rightmost local minimum.

    Base heuristic: Compare always go to the rightand always go the left. Choose the best of thetwo. Rollout finds aglobal minimum.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    163/261

    SEQUENTIAL CONSISTENCY

    The base heuristic is sequentially consistentiffor every node i, whenever it generates the path(i, i1, . . . , im, i)starting at i, it also generates thepath(i1, . . . , im, i)starting at the node i1(i.e., allnodes of its path have the same projection).

    Prime example of a sequentially consistent heuristic is agreedy algorithm. It uses anestimateF(i)of the optimal cost starting fromi.

    At the typical step, given a path(i, i1, . . . , im),whereimis not a destination, the algorithm adds

    to the path a nodeim+1such that

    im+1 = arg minjN(im)

    F(j)

    If the base heuristic is sequentially consistent,the cost of the rollout algorithm is no more thanthe cost of the base heuristic. In particular, if

    (s, i1, . . . , im)is the rollout path, we have

    H(s) H(i1) H(im1) H(im)whereH(i) =cost of the heuristic starting fromi.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    164/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    165/261

    6.231 DYNAMIC PROGRAMMING

    LECTURE 16

    LECTURE OUTLINE

    More on rollout algorithms

    Simulation-based methods Approximations of rollout algorithms Rolling horizon approximations

    Discretization issues

    Other suboptimal approaches

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    166/261

    ROLLOUT ALGORITHMS

    Rollout policy:At each kand state xk, usethe controlk(xk)that

    minukUk(xk)

    Qk(xk, uk),

    where

    Qk(xk, uk) =E

    gk(xk, uk, wk)+Hk+1

    fk(xk, uk, wk)

    andHk+1(xk+1)is the cost-to-go of the heuristic.

    Qk(xk, uk) is called the Q-factorof (xk, uk), andfor a stochastic problem, its computation may in-

    volve Monte Carlo simulation.

    Potential difficulty: To minimize overuktheQ-factor, we must form Q-factor differences Qk(xk, u)

    Qk(xk, u). This differencing often amplifies thesimulation error in the calculation of the Q-factors.

    Potential remedy: Compare any two controlsuanduby simulatingQk(xk, u)Qk(xk, u)directly.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    167/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    168/261

    ROLLING HORIZON APPROACH

    This is an l-step lookahead policy where thecost-to-go approximation is just 0.

    Alternatively, the cost-to-go approximation is theterminal cost functiongN.

    A short rolling horizon saves computation.

    Paradox: It is not true that a longer rollinghorizon always improves performance.

    Example: At the initial state, there are two con-trols available (1 and 2). At every other state, there

    is only one control.

    CurrentState

    Optimal Trajectory

    HighCost

    ... ...

    ... ...

    1

    2

    LowCost

    HighCost

    l Stages

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    169/261

    ROLLING HORIZON COMBINED WITH ROLLOUT

    We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.

    Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.

    Example: N-stage stopping problem where the

    stopping cost is 0, the continuation cost is eitheror 1, where 0 < < 1/N, and the first statewith continuation cost equal to 1 is statem. Thenthe optimal policy is to stop at state m, and theoptimal cost is

    m.

    0 1 2 m N

    Stopped State

    1... ...

    Consider the heuristic that continues at everystate, and the rollout policy that is based on thisheuristic, with a rolling horizon of l msteps. It will continue up to the firstm l+ 1stages,thus compiling a cost of (m l + 1). The rolloutperformance improves as lbecomes shorter!

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    170/261

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    171/261

    GENERAL APPROACH FOR DISCRETIZATION I

    Given a discrete-time system with state spaceS, consider a finite subsetS; for exampleScouldbe a finite grid within a continuous state spaceS.Assume stationarity for convenience, i.e., that thesystem equation and cost per stage are the same

    for all times.We define an approximation to the original prob-lem, with state spaceS, as follows:

    Express each x Sas a convex combinationof states inS, i.e.,

    x=xiS

    i(x)xi wherei(x) 0,i

    i(x) = 1

    Define a reduced dynamic system with state

    space S, whereby from each xi Swe move tox = f(xi, u , w)according to the system equationof the original problem, and then move to xj Swith probabilitiesj(x).

    Define similarly the corresponding cost per stage

    of the transitions of the reduced system.

  • 8/11/2019 MIT Dynamic Programming Lecture Slides

    172/261

    GENERAL APPROACH FOR DISCRETIZATION II