tkk | automation technology laboratory partially observable markov decision process (chapter 15...

TKK | Automation Technology Laboratory

Partially Observable Markov Decision Partially Observable Markov Decision Process Process

(Chapter 15 & 16)(Chapter 15 & 16)

José Luis Peralta

TKK | Automation Technology LaboratoryAS-84.4340 Postgraduate Course in Automation Technology

ContentsContents

• POMDP• Example POMDP• Finite World POMDP algorithm• Practical Considerations• Approximate POMDP Techniques


Partially Observable Markov Decision Partially Observable Markov Decision ProcessesProcesses(POMDP)(POMDP)

• POMDP: Uncertainty in Measurements State Uncertainty in Control Effects

• Adapt previous Value Iteration Algorithm (VI-VIA)



• POMDP: World can't be sensed directly

• Measurements: incomplete, noisy, etc.

• Partial Observability Robot has to estimate a posterior distribution over a

possible world state.



• POMDP: Algorithm to find optimal control policy exit for

FINITE WORLD:• State space • Action space • Space of observation • Planning horizon

Computation is complex For continuous case there are approximations

All Finite



• The algorithm we are going to study all based in Value Iteration (VI).

with

The same as previous but is not observable

• Robot has to make decision in the BELIEF STATE Robot’s internal knowledge about the state of the

environment Space of posteriori distribution over state

x

( )b



• So

with

• Control Policy



• Belief bel Each value in POMDP is function of

entire probability distribution

• Problems: State Space finite Belief Space

continuous State Space continuous Belief

Space infinitely-dimensional continuum

Also complexity in calculate the Value Function

Because of the integral over all the

distribution

( )b



• At the end optimal solution exist for Interesting Special Case of Finite World: state space; action space; space of observations;

planning horizon All finite

• Solution of VF are Piecewise Linear Function over the belief space

The previous arrive because • Expectation is a linear operation• Ability to select different controls in different parts


Example POMDPExample POMDP

2 States: 1 2,x x 3 Control Actions: 1 2 3, ,u u u


1 1

1 2

( , ) 100

( , ) 100

r x u

r x u

2 1

2 2

( , ) 100

( , ) 50

r x u

r x u


When execute payoff:

Dilemma opposite payoff in each state knowledge of the state translate directly into

payoff


1 3 2 3( , ) ( , ) 1r x u r x u


To acquire knowledge robot has control

affects the state of the world in non-deterministic manner:

(Cost of waiting, cost of sensing, etc.)

1 1 3

1 2 3

( , ) 0.2

( , ) 0.8

p x x u

p x x u

1 1 3

1 2 3

( , ) 0.8

( , ) 0.2

p x x u

p x x u

3u

3u



• Benefit Before each control decision, the robot can sense. By sensing robot gains knowledge about the state Make better control decisions High payoff expectation

• In the case of control action , robot sense without terminal action 3u



• The measurement model is governed by the following probability distribution:

1 1

1 2

( ) 0.7

( ) 0.3

p z x

p z x

2 1

2 2

( ) 0.3

( ) 0.7

p z x

p z x



This example is easy to graph over the belief space (2 states)• Belief state

1 1

2 2 2 1 1

( )

( ) but 1 so we just graph

p b x

p b x p p p



• Control Policy Function that maps the unit interval [0;1] to space of all

actions

:[0;1] u Example


Example POMDP – Control ChoiceExample POMDP – Control Choice

• Control Choice (When to execute what control?)

First consider the immediate payoff . Payoff now is a function of belief state

So for , the expected payoff

Payoff in POMDPs

1 2 3, ,u u u

1 2,b p p


Example POMDP – Control ChoiceExample POMDP – Control Choice• First we calculate

the robot simply selects the action of highest expected payoff

Piecewise Linear convex Function

Maximum of individual payoff function

1 1V T


Example POMDP – Control ChoiceExample POMDP – Control Choice• First we calculate

the robot simply selects the action of highest expected payoff

Transition occurs when in

1 2, ,r b u r b u

1

3

7p

Optimal Policy

1 1V T


Example POMDP - Sensing Example POMDP - Sensing

• Now we have perception What if the robot can sense before it chooses control? How it affects the optimal Value Function

Sensing info about State enable choose better control action

In previous example 13

7p

Expected payoff14,7

How better will this be after sensing?


Example POMDP – Control ChoiceExample POMDP – Control ChoiceBelief after sensing as a function of the belief before sensing

Given by Bayes Rule

Finally

1z

1

0.7 0.40.6087

0.4*0.4 0.3p


Example POMDP – Control ChoiceExample POMDP – Control ChoiceHow this affects the Value Function?



Mathematically

That is just replacing by in the Value Function 1p 1p 1V



However our interest is the complete Expected Value Function after sensing, that consider also the probability of sensing the other measurement . This is given by:2z


Example POMDP – Control ChoiceExample POMDP – Control ChoiceAn this results in



Mathematically


Example POMDP - PredictionExample POMDP - Prediction

To plan at a horizon larger than we have to take this into consideration and project our

value function accordingly

According to our transition probability model

In between the expectation is linear

If

If

1T


Example POMDP – PredictionExample POMDP – PredictionAn this results in


Example POMDP – PredictionExample POMDP – Prediction

And adding and we have: 1u 2u


Example POMDP – PredictionExample POMDP – Prediction

Mathematically

cost Fix!!31 of u


Example POMDP – PruningExample POMDP – PruningFull backup :

547,86420 is defined over 10

linear functions

T

561,012,33730 is defined over 10

linear functions

T

Impractical!!!

Efficient approximate POMDP needed


Finite World POMDP algorithmFinite World POMDP algorithm

To understand this read Mathematical Derivation of POMDPs pp.531-536 in [1]


Example POMDP – Practical ConsiderationsExample POMDP – Practical Considerations

It looks easy let’s try something more “real”…

Probabilistic Robot “RoboProb”


Example POMDP – Practical ConsiderationsExample POMDP – Practical ConsiderationsIt looks easy let’s try something more “real”…


11 States: 1 2 3 4, , ,x x x x

5 Control Actions: 1u

5 6 7 8, , ,x x x x9 10 11, ,x x x

2u3u 4u5u Sense without moving

1 2 3 45 6 78 9 10 11

0.10.1

0.8

Transition Model



It looks easy let’s try something more “real”…Probabilistic Robot

“RoboProb”

-0,04 -0,04 -0,04 1-0,04 -0,04 -1-0,04 -0,04 -0,04 -0,04

“Reward” Payoff

1 1

8 2

( , ) 0.04

( , ) 0.04

r x u

r x u

The same set for all control action

Example

7 5

7 3

( , ) 1

( , ) 1

r x u

r x u


Example POMDP – Practical ConsiderationsExample POMDP – Practical ConsiderationsIt’s getting kind of hard :S…


Transition Probability

( , )i j kp x x u

Example 1( , )i jp x x u

1u

1 2 3 4 5 6 7 8 9 10 111 0,9 0,1 0 0 0 0 0 0 0 0 02 0,1 0,8 0,1 0 0 0 0 0 0 0 03 0 0,1 0,8 0,1 0 0 0 0 0 0 04 0 0 0 1 0 0 0 0 0 0 05 0,8 0 0 0 0,2 0 0 0 0 0 06 0 0 0,8 0 0 0,1 0,1 0 0 0 07 0 0 0 0 0 0 1 0 0 0 08 0 0 0 0 0,8 0 0 0,1 0,1 0 09 0 0 0 0 0 0 0 0,1 0,8 0,1 0

10 0 0 0 0 0 0,8 0 0 0,1 0 0,111 0 0 0 0 0 0 0,8 0 0 0,1 0,1

Posteriori State

Cur

rent

Sta

te

1 2 3 45 6 78 9 10 11

0.10.1

0.8




Transition Probability

( , )i j kp x x u

Example 5( , )i jp x x u

1 2 3 45 6 78 9 10 11

1 2 3 4 5 6 7 8 9 10 111 1 0 0 0 0 0 0 0 0 0 02 0 1 0 0 0 0 0 0 0 0 03 0 0 1 0 0 0 0 0 0 0 04 0 0 0 1 0 0 0 0 0 0 05 0 0 0 0 1 0 0 0 0 0 06 0 0 0 0 0 1 0 0 0 0 07 0 0 0 0 0 0 1 0 0 0 08 0 0 0 0 0 0 0 1 0 0 09 0 0 0 0 0 0 0 0 1 0 0

10 0 0 0 0 0 0 0 0 0 1 011 0 0 0 0 0 0 0 0 0 0 1

Posteriori State

Cur

rent

Sta

te

5u




Measurement Probability

( )j ip z x

1 2 3 4 5 6 7 8 9 10 111 0,7 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,032 0,03 0,7 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,033 0,03 0,03 0,7 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,034 0,03 0,03 0,03 0,7 0,03 0,03 0,03 0,03 0,03 0,03 0,035 0,03 0,03 0,03 0,03 0,7 0,03 0,03 0,03 0,03 0,03 0,036 0,03 0,03 0,03 0,03 0,03 0,7 0,03 0,03 0,03 0,03 0,037 0,03 0,03 0,03 0,03 0,03 0,03 0,7 0,03 0,03 0,03 0,038 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,7 0,03 0,03 0,039 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,7 0,03 0,03

10 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,7 0,0311 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,03 0,7

Probability of Measuring Zi

Cur

rent

Sta

te




Belief States

1 1( )p b x

3 3( )p b x2 2( )p b x

11 11 2 3 10( ) 1p b x p p p

Impossible to graph!!




Each linear function results from executing control , followed by observing measurement , and then executing control .

uz

u




Defining Measurement Probability

Defining “Reward” Payoff

Defining Transition Probability

Merging Transition (Control) Probability




u

z

u

Setting Beliefs

Executing

Sensing

number of states

number of controlsC

N

N

timesCN

timesN

Executing timesCN


Example POMDP – Practical ConsiderationsExample POMDP – Practical ConsiderationsNow What…?


Calculating

number of states

number of controlsC

N

N

The real problem is to compute

,r b u timesN




Given a belief and a control action , the outcome is a distribution over distributions.

Because belief is also based on the next measurement, the measurement itself is generated stochastically.

,p b u bKey factor in this update is the conditional probability

This probability specifies a distribution over probability distributions.

b u

b




So we make

Contain only on non-zero term = b




Arriving to:

Just integrate over measurements instead of uzBecause our space is finite we have

With




At the end we have something

So, this VIA is far from practical. For any reasonable number of distinct states, measurements,

and controls, the complexity of the value function is prohibitive, even for relatively beginning planning horizons.

Need for approximations


Approximate POMDP TechniquesApproximate POMDP Techniques

• Here we have 3 approximate probabilistic planning and control algorithms QMDP AMDP MC-POMDP

• Varying degrees of practical applicability. • All 3 algorithms relied on approximations of the

POMDP value function. • They differed in the nature of their

approximations.


Approximate POMDP Techniques - QMDPApproximate POMDP Techniques - QMDP

• The QMDP framework considers uncertainty only for a single action choice: Assumes after the immediate next control action, the

state of the world suddenly becomes observable. Full observability make possible to use the MDP-

optimal value function. QMDP generalizes the MDP value function to belief

spaces through the mathematical expectation operator.

Planning in QMDPs is as efficient as in MDPs, but the value function generally overestimates the true value of a belief state.


Approximate POMDP Techniques - QMDPApproximate POMDP Techniques - QMDP

• Algorithm

• The QMDP framework considers uncertainty only for a single action choice.


Approximate POMDP Techniques - AMDPApproximate POMDP Techniques - AMDP

• Augmented-MDP (AMDP) maps the belief into a lower-dimensional representation, over which it then performs exact value iteration.

• “Classical" representation consists of the most likely state under a belief, along with the belief entropy.

• AMDPs are like MDPs with one added dimension in the state representation that measures global degree of uncertainty.

• To implement AMDP, its necessary to learn the state transition and the reward function in the low-dimensional belief space.



• “Classical" representation consists of the most likely state under a belief, along with the belief entropy.



MEAN

COVARIANCE

TRUE COVARIANCE

TRUE MEAN

ESTIMATED MEAN

ESTIMATED COVARIANCE



• AMDPs in mobile robot navigation is called coastal navigation.

• Anticipates uncertainty• Selects motion that trades off overall path

length with the uncertainty accrued along a path.

• Resulting trajectories differ significantly from any non-probabilistic solution.

• Being temporarily lost is acceptable, if the robot can later re-localize with sufficiently high probability.



• AMDP Algorithm


Approximate POMDP Techniques - MC-Approximate POMDP Techniques - MC-POMDPPOMDP

• The Monte Carlo MPOMDP (MC-POMDP)• Particle filter version of POMDPs. • Calculates a value function defined over sets of

particles. • MC-POMDPs uses local learning technique,

which used a locally weighted learning rule in combination with a proximity test based on KL-divergence.

• MC-POMDPs then apply Monte Carlo sampling to implement an approximate value backup.

• The resulting algorithm is a full-fledged POMDP algorithm whose computational complexity and accuracy are both functions of the parameters of the learning algorithm.



• particle set representing belief b

• Value Function



• MC-POMDP Algorithm


• Contents :- Motivation

- Conclusions- Problem Description- Objective - Robot Model- Experimental Results

discrete Monte Carlo representation of 1:11 kk yxp

set of N particles : )(1ikx

Draw new particles from proposal Distribution

)(1

)( ik

ik xxp

Given new observation ky

evaluate importance weights using likelihood function

)()( ikk

ik xypw

Resample Particles

Discrete Monte Carlo representation (aproximation) of kk yxp :1



References and LinksReferences and Links

• References[1] Thrun, Burgard, Fox. Probabilistic Robotics. MIT Press, 2005

• Linkshttp://en.wikipedia.org/wiki/Partially_observable_Markov_decision_processhttp://www.cs.cmu.edu/~trey/zmdp/http://www.cassandra.org/pomdp/index.shtml http://www.cs.duke.edu/~mlittman/topics/pomdp-page.html


ExerciseExerciseExercise 1 in [1] Chapter 15A person faces two doors. Behind one is a tiger, behind the other a reward of +10. The person can either listen or open one of the doors. When opening the door with a tiger, the person will be eaten, which has an associated cost of -20. Listening costs -1. When listening, the person will hear a roaring noise that indicates the presence of the tiger, but only with 0.85 probability will the person be able to localize the noise correctly. With 0.15 probability, the noise will appear as if it came from the door hiding the reward.

Your questions:

(a) Provide the formal model of the POMDP, in which you define the state, action, and measurement spaces, the cost function, and the associated probability functions. (b) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, listen, open door 1"? Explain your calculation.

(c) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, then open the door for which we did not hear a noise"? Again, explain your calculation.

tkk | automation technology laboratory partially observable markov decision process (chapter 15...

Documents