inverse reinforcement learning in partially observable environments

Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi Kee-Eung KimKorea Advanced Institute of Science and Technology.

JMLR Jan, 2011

2

Basics

Reinforcement Learning (RL) Markov Decision Process (MDP)

3

Reinforcement Learning

Actions

Reward

InternalState

Observation

4

Actions

Reward

InternalState

Observation

Inverse Reinforcement Learning

5

Why reward function ??

Solves the more natural problems

Most transferable representation of agent’s behaviour!

6

Example 1

Reward

7

Example 2

8

Agent

Name: Agent

Role: Decision making

Property: Principle of rationality

9

Environment

Markov DecisionProcess (MDP)

Partially Observable

Markov DecisionProcess (POMDP)

10

MDP

Sequential decision making problem States are directly perceived

11

POMDP

Sequential decision making problem States are perceived through some

noisy observationSeems like

near a wall !!!

Concept of belief

12

Policy

Explicit policy

Trajectory

13

IRL for MDP\RIRL for MDP\R

Policies TrajectoryLinear

approximation

QCP

Projection Method

Apprenticeship learning

14

Using Policies

Ng and Russel, 2000

Any policy deviating from expert’s policy should not yield a higher value.

15

Using Sample Trajectories Linear approximation for reward

function.

R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a)

= T

where, [-1,1]d

: SxA→ [0,1]d , basis functions.

16

Using Linear Programming

17

Apprenticeship Learn policy from expert’s

demonstration. Does not compute the exact reward

function.

18

Using QCP

Approximated using Projection method !

19

IRL in POMDP

Ill-posed problem Existence Uniqueness Stability

Computationally intractable

R = 0Exponenti

al increase in size!

20

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

21

Comparing Q functions

Constraint:

Disadvantage:For each n N, there are |A||N||Z| ways

to deviate one step from expert ! For n nodes, there are |N||A||N||Z|

ways to deviate – it grows exponentially !!!

22

DP Update Based Appraoch Comes from Generalized Howard’s

Policy Improvement Theorem.

Hansen, 1998

If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state.

24

Comparison

25

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

MMV Method

26

MMFE Method

27

Approximated using Projection (PRJ) Method !!!

28

Experimental Results

Tiger 1d Maze 5 x 5 Grid World Heaven / Hell Rock Sample

29

Illustration

30

Characteristics

31

Results from Policy

32

Results from Trajectories

33

Questions ???

34

Backup slides !

35

Inverse Reinforcement Learning

Given measurements of an agent’s behaviour

over time, in a variety of circumstances, Measurements of the sensory inputs to the

agent, a model of the physical environment

(including the agent’s body).Determine The reward function that the agent is

optimizing.

Russel (1998)

36

Partially Observable Environment

Mathematical framework for single-agent planning under uncertainty.

Agent cannot directly observe the underlying states.

Example: Study global warming from your grandfather’s diary !

37

Advantages of IRL

Natural way to examine animal and human behaviors.

Reward function – most transferable representation of agent’s behavior.

38

MDP Modeling a sequentially decision making

problem. Five tuple system: <S, A, T, R, γ>

S – finite set of states A – finite set of actions T – state transition function T:SxA →∏(S) R – Reward function R:SxA → Ɍ γ – Discount factor [o,1)

Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)

39

POMDP Partially observable environment Eight tuple system <S,A,Z,T,O;R,bo,γ>

Z – finite set of observation O:SxA →∏(Z), observation function bo – initial state distribution bo (s)

Belief (b) – b(s) is the probability that the state is s at the current time step.

(To reduce the complexity, introduced by the history of action-observation sequence).

40

Finite State Controller(FSC) Policy in POMDP is represented using

FSC. It’s a directed graph <N,E> nN is associated with an action,

aA eE is an outgoing edge per

observation zZ ∏ = < , >. is the action strategy

and is the observation strategy.

Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏

(<n,s>,<a,os>).

41

Using Projection Method

PRJ Method

42

inverse reinforcement learning in partially observable environments

Documents

experts policy

apprenticeshiplearn

policy deviating

value function

determinethe reward

exact reward function

agent role

n nodes