inverse reinforcement learning in partially observable environments

41
Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi Kee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011

Upload: maj

Post on 22-Feb-2016

67 views

Category:

Documents


3 download

DESCRIPTION

Jaedeug Choi Kee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011. Inverse Reinforcement Learning in Partially Observable Environments . Basics. Reinforcement Learning (RL) Markov Decision Process (MDP). Reinforcement Learning. Internal State. Actions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi Kee-Eung KimKorea Advanced Institute of Science and Technology.

JMLR Jan, 2011

Page 2: Inverse Reinforcement Learning in Partially Observable Environments

2

Basics

Reinforcement Learning (RL) Markov Decision Process (MDP)

Page 3: Inverse Reinforcement Learning in Partially Observable Environments

3

Reinforcement Learning

Actions

Reward

InternalState

Observation

Page 4: Inverse Reinforcement Learning in Partially Observable Environments

4

Actions

Reward

InternalState

Observation

Inverse Reinforcement Learning

Page 5: Inverse Reinforcement Learning in Partially Observable Environments

5

Why reward function ??

Solves the more natural problems

Most transferable representation of agent’s behaviour!

Page 6: Inverse Reinforcement Learning in Partially Observable Environments

6

Example 1

Reward

Page 7: Inverse Reinforcement Learning in Partially Observable Environments

7

Example 2

Page 8: Inverse Reinforcement Learning in Partially Observable Environments

8

Agent

Name: Agent

Role: Decision making

Property: Principle of rationality

Page 9: Inverse Reinforcement Learning in Partially Observable Environments

9

Environment

Markov DecisionProcess (MDP)

Partially Observable

Markov DecisionProcess (POMDP)

Page 10: Inverse Reinforcement Learning in Partially Observable Environments

10

MDP

Sequential decision making problem States are directly perceived

Page 11: Inverse Reinforcement Learning in Partially Observable Environments

11

POMDP

Sequential decision making problem States are perceived through some

noisy observationSeems like

near a wall !!!

Concept of belief

Page 12: Inverse Reinforcement Learning in Partially Observable Environments

12

Policy

Explicit policy

Trajectory

Page 13: Inverse Reinforcement Learning in Partially Observable Environments

13

IRL for MDP\RIRL for MDP\R

Policies TrajectoryLinear

approximation

QCP

Projection Method

Apprenticeship learning

Page 14: Inverse Reinforcement Learning in Partially Observable Environments

14

Using Policies

Ng and Russel, 2000

Any policy deviating from expert’s policy should not yield a higher value.

Page 15: Inverse Reinforcement Learning in Partially Observable Environments

15

Using Sample Trajectories Linear approximation for reward

function.

R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a)

= T

where, [-1,1]d

: SxA→ [0,1]d , basis functions.

Page 16: Inverse Reinforcement Learning in Partially Observable Environments

16

Using Linear Programming

Page 17: Inverse Reinforcement Learning in Partially Observable Environments

17

Apprenticeship Learn policy from expert’s

demonstration. Does not compute the exact reward

function.

Page 18: Inverse Reinforcement Learning in Partially Observable Environments

18

Using QCP

Approximated using Projection method !

Page 19: Inverse Reinforcement Learning in Partially Observable Environments

19

IRL in POMDP

Ill-posed problem Existence Uniqueness Stability

Computationally intractable

R = 0Exponenti

al increase in size!

Page 20: Inverse Reinforcement Learning in Partially Observable Environments

20

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

Page 21: Inverse Reinforcement Learning in Partially Observable Environments

21

Comparing Q functions

Constraint:

Disadvantage:For each n N, there are |A||N||Z| ways

to deviate one step from expert ! For n nodes, there are |N||A||N||Z|

ways to deviate – it grows exponentially !!!

Page 22: Inverse Reinforcement Learning in Partially Observable Environments

22

DP Update Based Appraoch Comes from Generalized Howard’s

Policy Improvement Theorem.

Hansen, 1998

If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state.

Page 23: Inverse Reinforcement Learning in Partially Observable Environments

24

Comparison

Page 24: Inverse Reinforcement Learning in Partially Observable Environments

25

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

Page 25: Inverse Reinforcement Learning in Partially Observable Environments

MMV Method

26

Page 26: Inverse Reinforcement Learning in Partially Observable Environments

MMFE Method

27

Approximated using Projection (PRJ) Method !!!

Page 27: Inverse Reinforcement Learning in Partially Observable Environments

28

Experimental Results

Tiger 1d Maze 5 x 5 Grid World Heaven / Hell Rock Sample

Page 28: Inverse Reinforcement Learning in Partially Observable Environments

29

Illustration

Page 29: Inverse Reinforcement Learning in Partially Observable Environments

30

Characteristics

Page 30: Inverse Reinforcement Learning in Partially Observable Environments

31

Results from Policy

Page 31: Inverse Reinforcement Learning in Partially Observable Environments

32

Results from Trajectories

Page 32: Inverse Reinforcement Learning in Partially Observable Environments

33

Questions ???

Page 33: Inverse Reinforcement Learning in Partially Observable Environments

34

Backup slides !

Page 34: Inverse Reinforcement Learning in Partially Observable Environments

35

Inverse Reinforcement Learning

Given measurements of an agent’s behaviour

over time, in a variety of circumstances, Measurements of the sensory inputs to the

agent, a model of the physical environment

(including the agent’s body).Determine The reward function that the agent is

optimizing.

Russel (1998)

Page 35: Inverse Reinforcement Learning in Partially Observable Environments

36

Partially Observable Environment

Mathematical framework for single-agent planning under uncertainty.

Agent cannot directly observe the underlying states.

Example: Study global warming from your grandfather’s diary !

Page 36: Inverse Reinforcement Learning in Partially Observable Environments

37

Advantages of IRL

Natural way to examine animal and human behaviors.

Reward function – most transferable representation of agent’s behavior.

Page 37: Inverse Reinforcement Learning in Partially Observable Environments

38

MDP Modeling a sequentially decision making

problem. Five tuple system: <S, A, T, R, γ>

S – finite set of states A – finite set of actions T – state transition function T:SxA →∏(S) R – Reward function R:SxA → Ɍ γ – Discount factor [o,1)

Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)

Page 38: Inverse Reinforcement Learning in Partially Observable Environments

39

POMDP Partially observable environment Eight tuple system <S,A,Z,T,O;R,bo,γ>

Z – finite set of observation O:SxA →∏(Z), observation function bo – initial state distribution bo (s)

Belief (b) – b(s) is the probability that the state is s at the current time step.

(To reduce the complexity, introduced by the history of action-observation sequence).

Page 39: Inverse Reinforcement Learning in Partially Observable Environments

40

Finite State Controller(FSC) Policy in POMDP is represented using

FSC. It’s a directed graph <N,E> nN is associated with an action,

aA eE is an outgoing edge per

observation zZ ∏ = < , >. is the action strategy

and is the observation strategy.

Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏

(<n,s>,<a,os>).

Page 40: Inverse Reinforcement Learning in Partially Observable Environments

41

Using Projection Method

Page 41: Inverse Reinforcement Learning in Partially Observable Environments

PRJ Method

42