1 policies for pomdps minqing hu. 2 background on solving pomdps mdps policy: to find a mapping from...
Post on 18-Dec-2015
227 views
TRANSCRIPT
![Page 1: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/1.jpg)
1
Policies for POMDPs
Minqing Hu
![Page 2: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/2.jpg)
2
Background on Solving POMDPs
• MDPs policy: to find a mapping from states to actions
• POMDPs policy: to find a mapping from probability distributions (over states) to actions. – belief state: a probability distribution over
states – belief space: the entire probability space,
infinite
![Page 3: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/3.jpg)
3
Policies in MDP• k-horizon Value function:
• Optimal policy δ*, is the one where, for all states, si and all other policies,
)()( 11
)()(jt
j
sij
siit sVpqsV tititt
)()(*ii sVsV
![Page 4: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/4.jpg)
4
Finite k-horizon POMDP
• POMDP: <S, A, P, Z, R, W>• transition probability: • probability of observing z after taking action a
and ending in state sj:• immediate rewards:• Immediate reward of performing action a in state
si:
• Object: to find an optimal policy for finite k-horizon POMDP
δ* = (δ1, δ2,…, δk)
aijp
ajzr
aijzw
aijz
ajz
zj
zij
ai wrpq
,
![Page 5: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/5.jpg)
5
A two state POMDP
• represent the belief state with a single number p.
• the entire space of belief states can be represented as a line segment.
belief space for a 2 state POMDP
![Page 6: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/6.jpg)
6
belief state updating• finite number of possible next belief states, given
a belief state– a finite number of actions – a finite number of observations
• b’ = T(b| a, z). Given a and z, b’ is fully determined.
![Page 7: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/7.jpg)
7
• the process of maintaining the belief state is Markovian: the next belief state depends only on the current belief state (and the current action and observation)
• we are now back to solving a MDP policy problem with some adaptations
![Page 8: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/8.jpg)
8
• continuous space: value function is some arbitrary function – b: belief space – V(b): value function
• Problem: how we can easily represent this value function? Value function over belief space
![Page 9: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/9.jpg)
9
Fortunately, the finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.
Sample PWLC function
![Page 10: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/10.jpg)
10
• A Piecewise Linear function consists of linear, or hyper-plane segments – Linear function:
– Kth linear segment:
– the -vector
– each liner or hyper-plane could be represented with
NNi
ii xxxx ...1100
N
ii
ki x
0
],...,,[ 10kN
kkk
)(tk
![Page 11: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/11.jpg)
11
• Value function:
• a convex function
i
kii
kt tbbV )(max)(*
![Page 12: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/12.jpg)
12
• 1-horizon POMDP problem– Single action a to execute– Starting out belief state b – Ending belief state b’
• b’ = T(b | a, z)
– Immediate rewards
– Terminating rewards for state si
Expected terminating reward in b’
i
iiqbbV 0'0 )'(
0iq
aiq
![Page 13: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/13.jpg)
13
• Value function of t =1
• The optimal policy for t =1
][max)( 0
,,
*1 i
i zji
ajz
aiji
aii
AaqrpbqbbV
][maxarg)( 0
,,
*1 i
i zji
ajz
aiji
aii
Aaqrpbqbb
![Page 14: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/14.jpg)
14
General k-horizon value function
• Same strategy for 1-horizon case
• Assume that we have the optimal value
function at t – 1, Vt-1*(.)
• Value function has same basic form as MDP, but– Current belief state– Possible observations– Transformed belief state
![Page 15: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/15.jpg)
15
• Value function
Piecewise linear and convex?
)]],|([[max)( *1
,,
* zabTVrpbqbbV ti zji
ajz
aiji
aii
Aat
i
kii
kt tbbV )(max)(*
![Page 16: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/16.jpg)
16
Inductive Proof• Base case: • Inductive hypothesis:
– transformed to:
– Substitute the transformed belief state:
ji
ajz
aiji
i
ajz
aiji
j rpb
rpbb
,
'
i
kii
kt tbbV )1(max)(*
1
i
kii
kt tbzabTV )1('max)),|((*
1
]
)1(
[max),|((
,
,*1
ji
ajz
aiji
ji
kj
ajz
aiji
kt rpb
trpb
zabTV
i
ii qbbV 00 )(
![Page 17: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/17.jpg)
17
Inductive Proof (contd)
• Value function at step t (using recursive definition)
• New α-vector at step t
• Value function at step t (PWLC)
)]1([max)(,
),,(*
trpqbbVi zj
zabj
ajz
aij
aii
Aat
)]1([maxarg),,(,
trpbzabji
kj
ajz
aiji
k
)1()(,
),,(* trpqbzj
zabj
ajz
aij
ait
i
kii
kt tbbV )(max)(*
![Page 18: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/18.jpg)
18
Geometric interpretation of value function
• |S| = 2
Sample value function for |S| = 2
![Page 19: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/19.jpg)
19
• |S| = 3• Hyper-planes• Finite number of
regions over the simplex
Sample value function for |S| = 3
![Page 20: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/20.jpg)
20
POMDP Value Iteration Example
• a 2-horizon problem
• assume the POMDP has – two states s1 and s2 – two actions a1 and a2– three observations z1, z2 and z3
![Page 21: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/21.jpg)
21
Given belief state b = [ 0.25, 0.75 ]
terminating reward = 0
1.125 1.5 x 0.75 0x 25.0)(
0.25 0 x 0.75 1x 25.0)(
5.1,0,0,1
21
11
22
21
12
11
bV
bV
qqqq
a
a
as
as
as
as
Horizon 1 value function
![Page 22: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/22.jpg)
22
• The blue region: – the best strategy
is a1
• the green region:– a2 is the best
strategy
Horizon 1 value function
![Page 23: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/23.jpg)
23
2-Horizon value function
• construct the horizon 2 value function with the horizon 1 value function.
• three steps:– how to compute the value of a belief state for
a given action and observation – how to compute the value of a belief state
given only an action – how to compute the actual value for a belief
state
![Page 24: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/24.jpg)
24
Step1:
• A restrict problem: given a belief state b, what is the value of doing action a1 first, and receiving observation z1?
• The value of a belief state for horizon 2 is the value of the immediate action plus the value of the next action.
![Page 25: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/25.jpg)
25
b’ = T(b | a1,z1)
Value of a fixed action and observation
immediate reward function horizon 1 value function
![Page 26: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/26.jpg)
26
S(a, z), a function which directly gives the value of each belief state after the action a1 is taken and observation z1 is seen
![Page 27: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/27.jpg)
27
• Value function of horizon 2 :
Immediate rewards S(a, z)
• Step 1: done
)]],|([[max)( *1
,,
*2 zabTVrpbqbbV
i zji
ajz
aiji
aii
Aa
![Page 28: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/28.jpg)
28
Step 2: how to compute the value of a belief state given only the action
Transformed value function
![Page 29: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/29.jpg)
29
So what is the horizon 2 value of a belief state, given a particular action a1?
depends on:• the value of doing action a1 • what action we do next
– depend on observation after action a1
![Page 30: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/30.jpg)
30
plus the immediate reward of doing action a1 in b
0.835 0.15x1.2) 0.25x0.7 (0.6x0.8)(
15.0,2.1)(:3
25.0,7.0)(:2
6.0,8.0)(:1
2
13
11
12
11
11
11
bV
rbVz
rbVz
rbVz
az
a
az
a
az
a
![Page 31: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/31.jpg)
31
Transformed value function for all observations
Belief point in transformed value function partitions
![Page 32: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/32.jpg)
32
Partition for action a1
![Page 33: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/33.jpg)
33
• The figure allows us to easily see what the best strategies are after doing action a1.
• the value of the belief point b at horizon 2
= the immediate reward from doing action a1 + the value of the functions S(a1,z1), S(a1,z3), S(a1,z3) at belief point b.
![Page 34: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/34.jpg)
34
• Each line segment is constructed by adding the immediate reward line segment to the line segments for each future strategy.
Horizon 2 value function and partition for action a1
![Page 35: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/35.jpg)
35
Repeat the process for action a2
Value function and partition for action a2
![Page 36: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/36.jpg)
36
Step 3: best horizon 2 policy
Combined a1 and a2 value functions Value function for horizon 2
![Page 37: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/37.jpg)
37
• Repeat the process for value functions of 3-horizon,…, and k-horizon POMDP
)]],|([[max)( *1
,,
* zabTVrpbqbbV ti zji
ajz
aiji
aii
Aat
![Page 38: 1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d225503460f949f88c2/html5/thumbnails/38.jpg)
38
Alternate Value function interpretation
• A decision tree– Nodes represent an action decision– Branches represent observation made
• Too many trees to be generated!