reinforcement learning partially observable markov decision processes (pomdp)

69
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP) 主主主 主主主 主主主主主主主 主主主主主 主主主

Upload: darci

Post on 05-Jan-2016

73 views

Category:

Documents


9 download

DESCRIPTION

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP). 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. Introduction Value iteration for MDP Belief States & Infinite-State MDP Value Function of POMDP The PWLC Property of Value Function. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

主講人:虞台文

大同大學資工所智慧型多媒體研究室

Page 2: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Content IntroductionValue iteration for MDPBelief States & Infinite-State MDPValue Function of POMDPThe PWLC Property of Value

Function

Page 3: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

Introduction

大同大學資工所智慧型多媒體研究室

Page 4: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Definition MDP A Markov decision process is a tuple

S a finite set of states of the world A a finite set of actions T: SA (S) state-transition function

R: SA R the reward function

, , ,S A T R

1( , , ) ( | , )t t tT s a s P s s s s a a

Page 5: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Complete Observability

Solution procedures for MDPs give values or policies for each state.

Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability.

Therefore, it is called CO-MDP (completely observable)

Page 6: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Partial Observability Instead of directly measuring the current

state, the agent makes an observation to get a hint about what state it is in.

How to get hint (guess the state)?– To do an action and take an observation.– The observation can be probabilistic, i.e., it

provides hint only.– The ‘state’ will be defined in probability

sense.

Page 7: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Observation Model

: ( )O S A

a finite set of observations the agent can experience of its world.

1 1( , , ) ( | , )t t tO s a o P o o s s a a

The probability of getting observation o given that the agent took action a and landed in state s’.

Page 8: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Definition POMDP

, , , , ,S A T R O

, , ,S A T R describes an MDP.

: ( )O S A is the observation function.

A POMDP is a tuple

How to find optimal policy in such an environment?

Page 9: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

Value Iteration for MDP

大同大學資工所智慧型多媒體研究室

Page 10: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Acting Optimality

Finite-Horizon Model

Infinite-Horizon Discounted Model

0

maximize k

tt

E r

Maximize the expected

total reward of the next k steps.

0

maximize tt

t

E r

Maximize the expected

discounted total reward.

0 1

Are there any difference on the nature of their optimal policies?

Page 11: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Stationary vs. Non-Stationary Policies

Finite-Horizon Model

Infinite-Horizon Discounted Model

The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A

Page 12: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Stationary vs. Non-Stationary Policies

Finite-Horizon Model

Infinite-Horizon Discounted Model

The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A The remained time steps.

Page 13: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Value Functions

Finite-Horizon Model

Infinite-Horizon Discounted Model

Non-stationary policy

Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

Page 14: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Optimal PoliciesFinite-Horizon Model

Infinite-Horizon Discounted Model

Non-stationary policy

Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

* *1arg max ( , ) ( , ,( ( )))t t

as S

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

s S

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Page 15: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Optimal PoliciesFinite-Horizon Model

Infinite-Horizon Discounted Model

Non-stationary policy

Stationary policy

* *1max ( , ) ( , , )( ) ( )t t

as S

V s V sR s a T s a s

* *max (( , ) ( , , )) ( )s S

V s V sR s a T s a s

* *1arg max ( , ) ( , ,( ( )))t t

as S

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

s S

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Page 16: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Optimal PoliciesFinite-Horizon Model Non-stationary policy

* *1max ( , ) ( , , )( ) ( )t t

as S

V s V sR s a T s a s

* *

1arg max ( , ) ( , ,( ( )))t ta

s S

R s a T s as V ss

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

How about t ?

How about Vt(s) Vt1(s) s?

How about t if Vt(s) Vt1(s) s?

To find an optimal policy, do we need to pay infinite time?

Page 17: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Value IterationThe MDP has finite number of states.

Page 18: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

Belief States & Infinite-State MDP

大同大學資工所智慧型多媒體研究室

Page 19: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Agent Agent

POMDP Framework

World (MDP)

SESE

observationaction

bbelief state

SE: state estimator

Page 20: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Belief States

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s

There are uncountably infinite number of belief states.

Page 21: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Space

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s

There are uncountably infinite number of belief states.

0 1

2-state POMDP

1( )b s 0 1

13-state POMDP

Page 22: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Estimation

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s

State estimation:

Given bt, at and ot+1, bt+1=?

There are uncountably infinite number of belief states.

Page 23: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o

b Normalization Factor

Page 24: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o

b Normalization Factor

1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

,1 ( | , )

a o tt

tP o a T b

bb

,1 ( | , )

a o tt

tP o a T b

bb

Remember these.Remember these.

Page 25: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o

b Normalization Factor

1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

( | , ) ( , , ) ( , , ) ( )t ts S s SP o a O s a o T s a s b s

b( | , ) ( , , ) ( , , ) ( )t ts S s S

P o a O s a o T s a s b s

b

It is linearw.r.t bt

Page 26: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Transition Function1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

ba

b’

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

b b b

( , , )

( | , )o

SE a o

P o a

b b

b

It is linearw.r.t bt

Page 27: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

State Transition Function

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

b b b

( , , )

( | , )o

SE a o

P o a

b b

b

It is linearw.r.t bt

Suppose that ( , , ) ( , , ) i jSE a o SE a o i j b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

Page 28: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

POMDP = Infinite-State MDP

A POMDP is an MDP with tuple B a set of Belief states A the finite set of actions (the same as

the original MDP) : BA (B) state-transition function

: BA R the reward function1( , , ) ( | , )t t ta P a a b b b b b b

What is the reward function?

, , , B A

Page 29: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reward Function

( , ) ( ) ( , )s S

R sa b s a

b

The reward function ofthe original MDP

Good news: It is Linear.

Page 30: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

Value Function of POMDP

大同大學資工所智慧型多媒體研究室

Page 31: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Consider a 2-state POMDP:

Value Function over Belief Space

b0 1

V(b) How to obtain the value function in belief space?

Can we use the table-based method?

Page 32: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Finding Optimal Policy

POMDP = Infinite-State MDPThe general method of MDP:

– To determine the value function and, then followed by policy improvement.

Value functions– State value function– Action value function

Page 33: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Review Value Iteration

Based on finite-horizon value function.

It finds on each iteration.*t

What is*1 ?

Page 34: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *1 *

1V

( , ) ( ) ( , )s S

R sa b s a

bImmediateReward

*1( ) (( ) , )

s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

Page 35: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

b0 1

*1

aQ

a1

a2*

1V

b0 1

a1

a2

Page 36: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Horizon-1 Policy Trees

Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

*1V

b0 1

a1

a2

a2 a1

P1

*1

Page 37: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Horizon-1 Policy Trees

Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

*1V

b0 1

It is piecewise linear and convex.(PWLC)

a2 a1

P1

*1

Page 38: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

s1

s2

(0,0)(1,0)

(1,0)

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

How about 3-state POMDP and more?

It is PWLC.

What is the policy?

Page 39: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

How about 3-state POMDP and more?

What is the policy?

Page 40: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0

N

ii

ki x

0 1[ , ,..., ]k k k kN α

( )k tα

Page 41: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0

N

ii

ki x

0 1[ , ,..., ]k k k kN α

( )k tα

( ) max is PWLC.Tk

kf x α x( ) max is PWLC.T

kk

f x α x

Page 42: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *t *

tV

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

tV b

Page 43: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *t *

tV

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

tV b

PWLC

PWLC?

Yes, it is.But, I will defer the proof.Yes, it is.But, I will defer the proof.

Page 44: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

**

12( ) ( , ) ( , , ) ( )aQ a a V

b

b bb b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

*, ,1( , ) ( )a o

o

a V

b b

*1( , ) ( ( ,| , , ))) (

o

a P o a V SE a o

b b b

** *, ,

2 12( ) arg max ( ) arg max ( , ) ( )a a o

a ao

V Q a V

b b b b

Page 45: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

0 1

a1

*1V

0 1

a1

a2

Compute 1*2

aQ

1( , )a bb

b’o1

o3 o2

( , , )SE a o bb ( , , )SE a o bb

Page 46: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

( , , )SE a o bb ( , , )SE a o bb

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

The and *2 *

2V

0 1

a1

*1V

0 1

a1

a2

Compute 1*2

aQ

1( , )a bb

b’o1

o3 o2

What action will you take if the observation is oi after a1 is taken? What action will you take if the observation is oi after a1 is taken?

Page 47: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

Consider individual observation (o) after action (a) is taken.

*, , *1 1( ) ( | , ) ( )a oV P o a V b b b

*1( | , ,( )() , )SE a oP o a V bb

Define

**, ,

12( ) ( , ) ( )a a o

o

Q a V

bb b

Page 48: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

0 1

a1( , )a b

0 1

a1a2*, ,

1 ( )a oV b

Transformed value function

0 1

a1a2 *

1 ( )V b

**, ,

12( ) ( , ) ( )a a o

o

Q a V

bb b

Page 49: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V*

*, ,12

( ) ( , ) ( )a a o

o

Q a V

bb b

0 1

a1

( , )a b

1 1,*,1 ( )a oV b

0 1

1 3,*,1 ( )a oV b

0 1

1 2,*,1 ( )a oV b

0 1

Page 50: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V*

*, ,12

( ) ( , ) ( )a a o

o

Q a V

bb b

0 1

a1

( , )a b

1 1,*,1 ( )a oV b

0 1

1 3,*,1 ( )a oV b

0 1

1 2,*,1 ( )a oV b

0 1

1*2

aQ

o1o2o3

Page 51: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

1*2

aQ

o1o2o3

a1

o1 o2o3

P1 P1 P1

*1

Page 52: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

1*2

aQ

o1o2o3

12 2( , , )aa a 1 1 2( , ),a a a

2*2

aQ

o1o2o3

Page 53: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

1*2

aQ2*2

aQ

Page 54: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *2 *

2V

1*2

aQ2*2

aQ

a1 a2 a1 a2

*2V

Page 55: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Horizon-2 Policy Tree

a1 a2 a1 a2

*2V

o1 o2 o3

P1 P1 P1

*1

P2

*2

Can you figure out How to determine the value function for horizon 3 from the above discussion?

Page 56: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *3

b

1 1*, ,2 ( )a oV b

a11 2*, ,

2 ( )a oV b1 3*, ,

2 ( )a oV b

a2

2 1*, ,2 ( )a oV b

2 2*, ,2 ( )a oV b

2 3*, ,2 ( )a oV b

*3V

1*3( )aQ b

2*3

( )aQ b

*3 ( )V b

a1 a2 a1 a2

*2V

Page 57: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *3 *

3V

1o

2o

3o

1o

2o

3o

1*3( )aQ b 2

*3( )aQ b

Page 58: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

The and *3 *

3V

1o

2o

3o

1o

2o

3o

1*3( )aQ b 2

*3( )aQ b

*3V *

3V

How for *t *

tVand ?

Page 59: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

*3V

Horizon-3 Policy Tree

o1 o2 o3

P1 P1 P1

*1

P2

*2

P3

o1 o2 o3

P2P2

*3

Page 60: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

The PWLC Property of Value Function

大同大學資工所智慧型多媒體研究室

Page 61: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Value Function for POMDP

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

ao

SE a oV a P o a V

b bb b

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b

Page 62: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Value Function for POMDP

*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b

Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

aa

V b r b

Page 63: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Value Function for POMDP

Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

aa

V b r b

Let ,1 kk aα r *1 ,1( ) max T

kk

V b α b

Page 64: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Theorem

*( )tV b is PWLC.

Page 65: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Proof*( )tV b is PWLC.

By induction

We already know*

1 ( )V b is true.

Assume*

1( )tV b is also true.

We then show *( )tV b must be true.

Page 66: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Proof

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

ao

SE a oV a P o a V

b bb b

From the assumption, we have

*1 , 1( ) m( , , x) a T

t k tk

SE aV o b α b

b

, 1 ( , , )max Tk t

kSE a o bα

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

Page 67: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Proof

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

Tt t

ao

SV o a oa EP a

b bb b b α

,( ,| , )

,(

) a oSE a oP o a

T b

bb

,( ,| , )

,(

) a oSE a oP o a

T b

bb

, 1( , ,, )max ( , ) Tt a o

aa

ooa

b bα Tb

Page 68: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Proof

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

Tt t

ao

SV o a oa EP a

b bb b b α

, 1( , ,, )max ( , ) Tt a o

aa

ooa

bb bα T

, 1, ) ,( ,max T Ta t aa

ao

oo

b T br α

Page 69: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Proof

( , , ,)*

, 1( ) max T Tt a t

ao

aa ooV

b bb r Tα

Let ( , , ), , 1 ,ii ia o a oT T

k t a to

b Tα r α

*,( ) max T

t k tk

V b α b

*( )tV b is PWLC.