-1.2truecmstochastic optimal control - stanford...

Post on 26-Jun-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

s tanford un i v er s i ty

Stochastic Optimal Control

Marco Pavone

Stanford University

April 29, 2015

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

s tanford un i v er s i ty

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

s tanford un i v er s i ty

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

s tanford un i v er s i ty

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

E

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

}

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

s tanford un i v er s i ty

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

E

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

}

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

s tanford un i v er s i ty

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

s tanford un i v er s i ty

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

s tanford un i v er s i ty

Example: Inventory Control Problem (1/2)

• Stock available xk ∈ N, inventory uk ∈ N, and demandwk ∈ N

• Dynamics: xk+1 = max(0, xk + uk − wk)

• Constraints: xk + uk ≤ 2

• Probabilistic structure: p(wk = 0) = 0.1, p(wk = 1) = 0.7,and p(wk = 2) = 0.2

• Cost

E

0︸︷︷︸g3(x3)

+2∑

k=0

(uk + (xk + uk − wk)2

)︸ ︷︷ ︸g(xk ,uk ,wk )

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

s tanford un i v er s i ty

Difficulties of DP

• Curse of dimensionality:• Exponential growth of the computational and storage

requirements

• Intractability of imperfect state information problems

• Curse of modeling: if “system stochastics” are complex, it isdifficult to obtain expressions for the transition probabilities

• Curse of time• The data of the problem to be solved is given with little

advance notice

• The problem data may change as the system iscontrolled—need for on-line replanning

s tanford un i v er s i ty

Solution: Approximate DP

• Certainty Equivalent Control

• Cost-to-Go Approximation

• Other Approaches (e.g., approximation in policy space)

s tanford un i v er s i ty

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

s tanford un i v er s i ty

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

s tanford un i v er s i ty

Cost-to-Go Approximation (CGA)

• Idea: Truncate time horizon and approximate cost-to-go

• One-step lookahead policy: at each k and state xk , usecontrol µ̄k(xk) that

minuk∈Uk (xk )

E{gk(xk , uk ,wk) + J̃k+1(fk(xk , uk ,wk))

},

• J̃N = gN

• J̃k+1: approximation to true-cost-to-go Jk+1

• Analogously, two-step lookahead policy: all of the above and

J̃k+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{gk+1(xk+1, uk+1,wk+1)

+ J̃k+2(fk+1(xk+1, uk+1,wk+1))}

s tanford un i v er s i ty

CGA—Computational Aspects

• If J̃k+1 is readily available and minimization not too hard, thisapproach is implementable on-line

• Choice of approximating functions J̃k is critical

1 Problem Approximation: approximate by considering simplerproblem

2 Parametric Cost-to-Go Approximation: approximate cost-to-gofunction with function of suitable parametric form (parameterstuned by some scheme → neuro-dynamic programming)

3 Rollout Approach: approximate cost-to-go with cost of somesuboptimal policy

s tanford un i v er s i ty

CGA—Problem Approximation

• Many problem-dependent possibilities• Replace uncertain quantities by nominal values (in the spirit of

CEC)

• Simplify difficult constraints or dynamics

• Decouple subsystems

• Aggregate states

s tanford un i v er s i ty

CGA—Parametric Approximation

• Use a cost-to-go approximation from a parametric classJ̃(x , r) where x is the current state and r = (r1, . . . , rm) is avector of “tunable” weights

• Two key aspects

• Choice of parametric class J̃(x , r)• Example: future extraction method

J̃(x , r) =m∑i=1

ri yi (x),

where the yi ’s are features

• Algorithm for tuning the weights (possibly, simulation-based)

s tanford un i v er s i ty

CGA—Rollout Approach

• J̃k is the cost-to-go of some heuristic policy (called the basepolicy)

• To compute rollout control, one need for all uk

Qk(xk , uk) := E {gk(xk , uk ,wk) + Hk+1(fk(xk , uk ,wk))} ,

where Hk+1 is the value of the cost-to-go for the base policy

• Q-factors can be evaluated via Monte-Carlo simulation

• Q-factors can be approximated, e.g., by using a CEC approach

• Model predictive control (MPC) can be viewed as a specialcase of rollout algorithms (AA 203)

s tanford un i v er s i ty

Other ADP Approaches

• Minimize the DP equation error

• Direct approximation of control policies

• Approximation in policy space

s tanford un i v er s i ty

References

• Bertsekas, D. P. Dynamic programming and optimal control.Volumes 1 & 2. Athena Scientific, 2005.

• Puterman, M. L. Markov decision processes: discretestochastic dynamic programming. John Wiley & Sons, 2014.

• Barto, A. G. Reinforcement learning: An introduction. MITpress, 1998.

top related