Download - Making complex decisions
![Page 1: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/1.jpg)
Making complex decisions
Chapter 17
![Page 2: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/2.jpg)
Outline
• Sequential decision problems (Markov Decision Process) – Value iteration– Policy iteration
![Page 3: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/3.jpg)
Sequential Decisions
• Agent’s utility depends on a sequence of decisions
![Page 4: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/4.jpg)
Markov Decision Process (MDP)
• Defined as a tuple: <S, A, M, R>– S: State– A: Action– M: Transition function
• Table Mija = P(sj| si, a), prob of sj| given action “a” in state si
– R: Reward
• R(si, a) = cost or reward of taking action a in state si
• In our case R = R(si)
• Choose a sequence of actions (not just one decision or one action)– Utility based on a sequence of decisions
![Page 5: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/5.jpg)
Generalization Inputs:
• Initial state s0
• Action model• Reward R(si) collected in each state si
A state is terminal if it has no successor Starting at s0, the agent keeps executing actions
until it reaches a terminal state Its goal is to maximize the expected sum of
rewards collected (additive rewards) Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) +
R(s2)+ … Discounted rewards (we will not use this reward form)
U(s0, s1, s2, …) = R(s0) + R(s1) + 2R(s2) + … (0 1)
![Page 6: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/6.jpg)
Dealing with infinite sequences of actions
• Use discounted rewards
• The environment contains terminal states that the agent is guaranteed to reach eventually
• Infinite sequences are compared by their average rewards
![Page 7: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/7.jpg)
Example
• Fully observable environment• Non deterministic actions
– intended effect: 0.8– not the intended effect: 0.2
1 2 3 4
1
2
3 +1
-1
start
0.8
0.10.1
![Page 8: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/8.jpg)
MDP of example• S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)• A: Actions of the agent, i.e., N, E, S, W
• M: Transition function – E.g., M( (4,2) | (3,2), N) = 0.1– E.g., M((3, 3) | (3,2), N) = 0.8– (Robot movement, uncertainty of another agent’s actions,
…)
• R: Reward (more comments on the reward function later)– R(1, 1) = -1/25, R(1, 2) = -1/25 ….– R (4,3) = +1 R(4,2) = -1
= 1
![Page 9: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/9.jpg)
Policy
• Policy is a mapping from states to actions.• Given a policy, one may calculate the expected
utility from series of actions produced by policy.
• The goal: Find an optimal policy , one that would produce maximal expected utility.
+1
-1
1 2 3 4
1
2
3
Example of policy
![Page 10: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/10.jpg)
Utility of a StateGiven a state s we can measure the expected utilities by
applying any policy .
We assume the agent is in state s and define St (a random variable) as the state reached at step t.
Obviously S0 = s
The expected utility of a state s given a policy is
U(s) = E[t t R(St)]
*s will be a policy that maximizes the expected utility of state s.
![Page 11: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/11.jpg)
Utility of a State
The utility of a state s measures its desirability:
If i is terminal:U(Si) = R(Si)
If i is non-terminal, U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) [Bellman equation]
[the reward of s augmented by the expected sum of discounted rewards collected in future states]
![Page 12: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/12.jpg)
Utility of a State
The utility of a state s measures its desirability:
If i is terminal:U(Si) = R(Si)
If i is non-terminal, U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) [Bellman equation]
[the reward of s augmented by the expected sum of discounted rewards collected in future states]
dynamic programming
![Page 13: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/13.jpg)
Optimal Policy
A policy is a function that maps each state s into the action to execute if s is reached
The optimal policy * is the policy that always leads to maximizing the expected sum of rewards collected in future states (Maximum Expected Utility principle)
*(Si) = argmaxajP(Sj|Si,a) U(Sj)
![Page 14: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/14.jpg)
Optimal policy and state utilities for MDP
• What will happen if cost of step is very low?
+1
-1
0.812
0.868
0.912
+1
0.762
0.660
-1
0.705
0.655
0.611
0.388
1 2 3 4 1 2 3 4
1
2
3
1
2
3
![Page 15: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/15.jpg)
Finding the optimal policy
• Solution must satisfy both equations *(Si) = argmaxajP(Sj|Si,a) U(Sj) (1)
– U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) (2)
• Value iteration:– start with a guess of a utility function and use (2)
to get a better estimate
• Policy iteration:– start with a fixed policy, and solve for the exact
utilities of states; use eq 1 to find an updated policy.
![Page 16: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/16.jpg)
Value iteration
function Value-Iteration(MDP) returns a utility functioninputs: P(Si |Sj,a), a transition model R, a reward function on states
discount parameterlocal variables: Utility function, initially identical to R
U’, utility function, initially identical to Rrepeat
U U’for each state i do
U’ [Si] R [Si] + maxa j P(Sj |Si,a) U (Sj) end
until Close-Enough(U,U’)return U
![Page 17: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/17.jpg)
Value iteration - convergence of utility values
![Page 18: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/18.jpg)
policy iterationfunction Policy-Iteration (MDP) returns a policyinputs: P(Si |Sj,a), a transition model R, a reward function on states
discount parameter local variables: U, utility function, initially identical to R
, a policy, initially optimal with respect to Urepeat
U Value-Determination(, U, MDP)unchaged? truefor each state Si do
if maxa jP(Sj |Si,a) U (Sj) > jP(Sj |Si, (Si)) U (Sj) (Si) arg maxa jP(Sj |Si,a) U (Sj)
unchaged? falseuntil unchaged?return
![Page 19: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/19.jpg)
Decision Theoretic Agent
• Agent that must act in uncertain environments. Actions may be non-deterministic.
![Page 20: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/20.jpg)
Stationarity and Markov assumption
• A stationary process is one whose dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are assumed not to change with time.
• The Markov assumption is that the current state Xt is dependent only on a finite history of previous states. In this class, we will only consider first-order Markov processes for which
P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)
![Page 21: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/21.jpg)
Transition model and sensor model
• For first-order Markov processes, the laws describing how the process state evolves with time is contained entirely within the conditional distribution P(Xt|Xt-1), which is called the transition model for the process.
• We will also assume that the observable state variables (evidence) Et are dependent only on the state variables Xt. P(Et|Xt) is called the sensor or observational model.
![Page 22: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/22.jpg)
Decision Theoretic Agent - implemented
![Page 23: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/23.jpg)
sensor model
generic Example – part of steam train control
![Page 24: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/24.jpg)
sensor model II
![Page 25: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/25.jpg)
Model for lane position sensor for automated vehicle
![Page 26: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/26.jpg)
dynamic belief network – generic structure
• Describe the action model, namely the state evolution model.
![Page 27: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/27.jpg)
State evolution model - example
![Page 28: Making complex decisions](https://reader035.vdocuments.site/reader035/viewer/2022081418/5681591e550346895dc64772/html5/thumbnails/28.jpg)
Dynamic Decision Network
• Can handle uncertainty• Deal with continuous streams of
evidence• Can handle unexpected events (they
have no fixed plan)• Can handle sensor noise and sensor
failure• Can act to obtain information• Can handle large state spaces