1. algorithms for inverse reinforcement learning 2. apprenticeship learning via inverse...

34
1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Upload: austen-heath

Post on 16-Dec-2015

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

1. Algorithms for Inverse Reinforcement Learning

2. Apprenticeship learning via Inverse Reinforcement Learning

Page 2: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Algorithms for Inverse Reinforcement Learning

Andrew Ng and Stuart Russell

Page 3: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Motivation

● Given: (1) measurements of an agent's behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.

● Determine: the reward function being optimized.

Page 4: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Why?

● Reason #1: Computational models for animal and human learning.

● “In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”

● Particularly true of multiattribute reward functions (e.g. Bee foraging: amount of nectar vs. flight time vs. risk from wind/predators)

Page 5: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Why?

● Reason #2: Agent construction.● “An agent designer [...] may only have a very

rough idea of the reward function whose optimization would generate 'desirable' behavior.”

● e.g. “Driving well”● Apprenticeship learning: Recovering expert's

underlying reward function more “parsimonious” than learning expect's policy?

Page 6: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Possible applications in multi-agent systems

● In multi-agent adversarial games, learning opponents’ reward functions that guild their actions to devise strategies against them.

● example● In mechanism design, learning each agent’s

reward function from histories to manipulate its actions.

● and more?

Page 7: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (1) – MDP Recap

• MDP is represented as a tuple (S, A, {Psa}, ,R)

Note: R is bounded by Rmax

• Value function for policy :

• Q-function:

]|..........)()([)( 211 sRsREsV

)]'([)(),( (.)~' sVEsRasQsaPs

Page 8: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (1) – MDP Recap

• Bellman Equation:

• Bellman Optimality:

')(

')(

)'()'()()(

)'()'()()(

sss

sss

sVsPsRsQ

sVsPsRsV

),(maxarg)( asQsAa

Page 9: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (2)Finite State Space

• Reward function solution set (a1 is optimal action)

RPIV

VPRV

a

a

1)(1

1

0))((

)()(

),(maxarg)(

1

11

1

11

111

1

RPIPP

PIPRPIP

VPVP

asQsa

aaa

aaaa

aa

Aa

1\ aAa

1\ aAa

1\ aAa

Page 10: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (2)Finite State Space

)),(max),((1\

1 asQasQSs

aAa

There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?

1. Make deviation from as costly as possible:

2. Make reward function as simple as possible

Page 11: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (2)Finite State Space

• Linear Programming Formulation:

NiRR

RPIiPiPst

RRPIiPiP

i

aaa

N

iaaaaaaa k

,....,1,

0)))(()(.(

})))(()({(minmax

max

1

11

1},....,,{

11

1132

1\ aAa

a1, a2, …, an

R?

Va1

Va2maximized

Page 12: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (3)Large State Space

• Linear approximation of reward function (in driving example, basis functions can be collision, stay on right lane,…etc)

• Let be value function of policy , when reward R =

• For R to make optimal

)(........)()()( 2211 ssssR dd

)(1 sa

)]'([)]'([ ~'~'1

sVEsVEsasa PsPs

1\ aAa

)(si

ddVVVV ........2211

iV )(si

Page 13: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (3)Large State Spaces

• In an infinite or large number of state space, it is usually not possible to check all constraints:

• Choose a finite subset S0 from all states

• Linear Programming formulation, find αi that:

• x>=0, p(x)=x; otherwise p(x)=2x

)]'([)]'([ ~'~'1

sVEsVEsasa PsPs

dits

sVEsVEp

i

PsPsSs

aaaa sasak

,....,1,1..

)])}'([)]'([({minmax ~'~'},{1

0

,....,32

Page 14: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (4)IRL from Sample Trajectories

• If is only accessible through a set of sampled trajectories (e.g. driving demo in 2nd paper)

• Assume we start from a dummy state s0,(whose next state distribution is according to D).

• In the case that reward trajectory state sequence (s0, s1, s2….):

iR

..........)()()()( 22

100

^

ssssV iiii

)(.........)()( 0

^

01

^

10

^

sVsVsV dd

Page 15: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning (4)IRL from Sample Trajectories

• Assume we have some set of policies

• Linear Programming formulation

• The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies

• reiterate

dits

sVsVp

i

k

i

i

....1,1..

))()((max1

0

^

0

*^

k ,....,, 21

1k

Page 16: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Discrete Gridworld Experiment

● 5x5 grid world● Agent starts in bottom-left square.● Reward of 1 in the upper-right square.● Actions = N,W,S,E (30% chance of random)

Page 17: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Discrete Gridworld Results

Page 18: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Mountain Car Experiment #1

● Car starts in valley, goal is at the top of hill● Reward is -1 per “step” until goal is reached● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations

of 26 evenly spaced Gaussian-shaped basis functions

Page 19: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Mountain Car Experiment #2

● Goal is in bottom of valley● Car starts... not sure. Top of hill?● Reward is 1 in the goal area, 0 elsewhere● γ = 0.99● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations

of 26 evenly spaced Gaussian-shaped basis functions

Page 20: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Mountain Car Results

#1

#2

Page 21: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Continuous Gridworld Experiment

● State space is now [0,1] x [0,1] continuous grid● Actions: 0.2 movement in any direction + noise

in x and y coordinates of [-0.1,0.1]● Reward 1 in region [0.8,1] x [0.8,1], 0 elsewhere● γ = 0.9● Function approx. class: all linear combinations

of a 15x15 array of 2-D Gaussian-shaped basis functions

● m=5000 trajectories of 30 steps each per policy

Page 22: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Continuous Gridworld Results

3%-10% error when comparing fitted reward's optimal policy with the true optimal policy

However, no significant difference in quality of policy (measured using true reward function)

Page 23: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel & Andrew Y. Ng

Page 24: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Algorithm

● For t = 1,2,…

Inverse RL step: Estimate expert’s reward function R(s)=

wT(s) such that under R(s) the expert performs better than all previously found policies {i}.

RL step: Compute optimal policy t for the estimated reward w.

Courtesy of Pieter Abbeel

Page 25: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Algorithm: IRL step

● Maximize , w:||w||2

≤ 1

● s.t. Vw(E) Vw(i) + i=1,…,t-1

● = margin of expert’s performance over the performance of previously found policies.

● Vw() = E [t t R(st)|] = E [t t wT(st)|]

● = wT E [t t (st)|]

● = wT ()

● () = E [t t (st)|] are the “feature expectations”

Courtesy of Pieter Abbeel

Page 26: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Feature Expectation Closeness and Performance

● If we can find a policy such that

● ||(E) - ()||2 ,● then for any underlying reward R*(s) =w*T(s), ● we have that

● |Vw*(E) - Vw*()| = |w*T (E) - w*T ()|

● ||w*||2 ||(E) - ()||2

● .

Courtesy of Pieter Abbeel

Page 27: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

IRL step as Support Vector Machine

maximum margin hyperplane seperatingtwo sets of points

(E)

()

|w*T (E) - w*T ()|= |Vw*(E) - Vw*()| = maximal difference between expert policy’s value function and 2nd to the optimal policy’s value function

Page 28: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

1

(0)

w(1)

w(2)(1)

(2)

2

w(3)

Uw() = wT()

(E)

Courtesy of Pieter Abbeel

Page 29: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Gridworld Experiment

● 128 x 128 grid world divided into 64 regions, each of size 16 x 16 (“macrocells”).

● A small number of macrocells have positive rewards.

● For each macrocell, there is one feature Φi(s)

indicating whether that state s is in macrocell i● Algorithm was also run on the subset of

features Φi(s) that correspond to non-zero

rewards.

Page 30: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Gridworld Results

Performance vs. # TrajectoriesDistance to expert vs. # Iterations

Page 31: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Car Driving Experiment

● No explict reward function at all!● Expert demonstrates proper policy via 2 min. of

driving time on simulator (1200 data points).● 5 different “driver types” tried.● Features: which lane the car is in, distance to

closest car in current lane.● Algorithm run for 30 iterations, policy hand-

picked.● Movie Time! (Expert left, IRL right)

Page 32: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Demo-1 Nice

Page 33: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Demo-2 Right Lane Nasty

Page 34: 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Car Driving Results