apprenticeship learning via inverse reinforcement...

Inverse Reinforcement Learning

Krishnamurthy Dvijotham

UW

(Most slides borrowed/adapted from Pieter Abbeel)

Reinforcement Learning

Almost the same as Optimal Control

“Reinforcement” term coined by psychologists studying animal learning

Focus on discrete state spaces, highly stochastic environments, learning to control without knowing system model in general

Work with rewards instead of costs, usually discounted

Inverse Reinforcement Learning

Dynamics Model

Expert Trajectories

IOC AlgorithmReward

Function R

R explains expert trajectories

Scientific inquiry

Model animal and human behavior

E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]

Apprenticeship learning/Imitation learning through inverse RL

Presupposition: reward function provides the most succinct and transferable definition of the task

Has enabled advancing the state of the art in various robotic domains

Modeling of other agents, both adversarial and cooperative

Motivation for inverse RL

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Simulated highway driving

Abbeel and Ng, ICML 2004,

Syed and Schapire, NIPS 2007

Aerial imagery based navigation

Ratliff, Bagnell and Zinkevich, ICML 2006

Parking lot navigation

Abbeel, Dolgov, Ng and Thrun, IROS 2008

Urban navigation

Ziebart, Maas, Bagnell and Dey, AAAI 2008

Quadruped locomotion

Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007

Kolter, Abbeel and Ng, NIPS 2008

Examples





Examples

Lecture outline

Input:

State space, action space

Transition model Psa(st+1 | st, at)

No reward function

Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …

(= trace of the teacher’s policy *)

Inverse RL:

Can we recover R ?

Apprenticeship learning via inverse RL

Can we then use this R to find a good policy ?

Behavioral cloning

Can we directly learn the teacher’s policy using supervised learning?

Problem setup

Formulate as standard machine learning problem

Fix a policy class

E.g., support vector machine, neural network, decision tree, deep belief net, …

Estimate a policy (=mapping from states to actions) from the training examples (s0, a0), (s1, a1), (s2, a2), …

Two of the most notable success stories:

Pomerleau, NIPS 1989: ALVINN

Sammut et al., ICML 1992: Learning to fly (flight sim)

Behavioral cloning

Which has the most succinct description: ¼* vs. R*?

Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.






Examples

Lecture outline

1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case

1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting

2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested

2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching

2006, Ratliff+al: max margin formulation

Inverse RL history

2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features

2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class

2008, Kolter, Abbeel and Ng: hierarchical max-margin

2008, Syed and Schapire: feature matching + game theoretic formulation

2008, Ziebart+al: feature matching + max entropy

2010, Dvijotham, Todorov: Inverse RL with LMDPs

Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ?

Inverse RL history





Examples

Lecture outline

IRL with LMDP : OptV

Algorithm (OptV):1. Input: Transitions {(x_n,y_n)} sampled

from u*, Dynamics P(x’|x)2. Estimate v by maximizing log-likelihood:

3. Compute q from v using

No need to solve opt control prob repeatedly

Continuous State Spaces

How to represent v in a continuous state space?

Typical Sample (2d) Local Gaussian Bases

Discretize : Blows up exponentially with dimension

Intuition

RBFs

Representation of v in continuous state spaces?

1. Lots of approximating power in highly visited regions

2. Low approximating power (smoother approximation) in low-density regions

Basis Adaptation

Complete log-likelihood:

Algorithm (OptVA)1. Initialize θ “cleverly”2. Until Convergence:

1. LBFGS to compute L*(θ)2. Conjugate Gradient ascent on L*(θ)

OptQ

Log-likelihood of set of trajectories

Observe set of expert trajectories

Express z in terms of q

Resulting expression convex in q : Optimize!





IRL with LMDPs

IRL with Traditional MDPs

Examples

Lecture outline

IRL in Traditional MDPs

GridWorld : •Controls:•First-Exit : Minimize cost until you hit goal

IRL Ill-Posed(Ng ’00)

IRL in Traditional MDPs

Let be dynamics under policy Π. Then, any reward function satisfying

makes Π optimal

Find a reward function R* which explains the expert

behaviour.

Find R* such that

In fact a convex feasibility problem, but many challenges:

R=0 is a solution, more generally: reward function ambiguity

We typically only observe expert traces rather than the entire expert policy ¼* --- how to compute LHS?

Assumes the expert is indeed optimal --- otherwise infeasible

Computationally: assumes we can enumerate all policies

IRL in MDPs : Basic principle

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

ff

Subbing into

gives us:

Feature based reward function

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.

E[

1X

t=0

°tR(st)j¼] = E[

1X

t=0

°tw>Á(st)j¼]

= w>E[

1X

t=0

°tÁ(st)j¼]

= w>¹(¼)

Expected cumulative discounted sum of feature values or “feature expectations”

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Feature expectations can be readily estimated from sample trajectories.

The number of expert demonstrations required scales with the number of features in the reward function.

The number of expert demonstration required does not depend on

Complexity of the expert’s optimal policy ¼*

Size of the state space directly

Feature based reward function

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

Challenges:

Assumes we know the entire expert policy ¼*

assumes we can estimate expert feature expectations

R=0 is a solution (now: w=0), more generally: reward function ambiguity

Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness!

Computationally: assumes we can enumerate all policies

Recap of challenges

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Standard max margin:

“Structured prediction” max margin:

Justification: margin should be larger for policies that are very different from ¼*.

Example: m(¼, ¼*) = number of states in which ¼* was observed and in which ¼ and ¼* disagree

Ambiguity

minwkwk22

s:t: w>¹(¼¤) ¸ w>¹(¼) + 1 8¼

minwkwk22

s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼) 8¼

Structured prediction max margin with slack variables:

Can be generalized to multiple MDPs (could also be same MDP with different initial state)

Expert suboptimality

minwkwk22 + C»

s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼)¡ » 8¼

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)

Resolved: access to ¼*, ambiguity, expert suboptimality

One challenge remains: very large number of constraints

Ratliff+al use subgradient methods.

In this lecture: constraint generation

Complete max-margin formulation

[Ratliff, Zinkevich and Bagnell, 2006]

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)

Initialize ¦(i) = {} for all i and then iterate

Solve

For current value of w, find the most violated constraint for all i by solving:

= find the optimal policy for the current estimate of the reward function (+ loss augmentation m)

For all i add ¼(i) to ¦(i)

If no constraint violations were found, we are done.

Constraint generation

max¼(i)

w>¹(¼(i)) + m(¼(i)¤; ¼(i))

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; 8¼(i) 2 ¦(i)

Every policy ¼ has a corresponding feature expectation vector ¹(¼),

which for visualization purposes we assume to be 2D

Visualization in feature expectation space

1

2

(*)

E[

1X

t=0

°tR(st)j¼] = w>¹(¼)

max margin

structured max margin (?)

wmm

wsmm

Every policy ¼ has a corresponding feature expectation vector ¹(¼),

which for visualization purposes we assume to be 2D

Constraint generation

1

(0)

(2)

2

(*)

(1)

max¼

E[

1X

t=0

°tRw(st)j¼] = max¼

w>¹(¼)

w(2)

w(3) = wmm

constraint generation:

Inverse RL starting point: find a reward function such that the expert outperforms other policies

Observation in Abbeel and Ng, 2004: for a policy ¼ to be guaranteed to perform as well as the expert policy ¼*, it

suffices that the feature expectations match:

implies that for all w with

Feature matching

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.

k¹(¼)¡¹(¼¤)k1 · ²

kwk1 · 1:

E[

1X

t=0

°tRw¤(st)j¼] = w¤>¹(¼) ¸ w¤>¹(¼¤)¡ ² = E[

1X

t=0

°tRw¤(st)j¼¤]¡ ²

Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Holder's inequality:

8p; q ¸ 1 if1

p+

1

q= 1; then x>y · kxkpkykq



History of inverse RL


Examples

Lecture outline

GridWorld CPU times

Results

•Colored plots represent functions f(x,y)•Red - High, Blue – Low•Continuous state problems, 2d state space•x-axis : Position, y-axis: Velocity

Results on Continuous Problems

True Cost

Comparison to Other Algorithms

Optimal Policy

OptVEstimate

2 min

Ng ’00 Estimate

Abbeel ’04 Estimate

Discretized controls to for other algos to make them run tractably

Example of apprenticeship learning via inverse RL


Sketch of history of inverse RL


Case studies

Inverse Optimal Control with LMDPs

Lecture outline


Sketch of history of inverse RL


Inverse RL with LMDPs vs Inverse RL with Traditional MDPs

Summary

Ng and Russell. Algorithms for inverse reinforcement Learning (ICML 2000)

Abbeel and Ng. Apprenticeship learning via inverse reinforcement learning. (ICML 2004)

Boyd, Ghaoui, Feron and Balakrishnan. (1994). Linear matrix inequalities in system and control theory. SIAM.

Kolter, Abbeel and Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. (NIPS 2008)

Neu and Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. (UAI 2007)

Ng and Russell. Algorithms for inverse reinforcement learning. (ICML 2000)

Ratliff, Bagnell and Zinkevich. Maximum margin planning. (ICML 2006)

Ratliff, Bradley, Bagnell and Chestnutt. Boosting structured prediction for imitation learning. (NIPS 2007)

Ramachandran and Amir. Bayesian inverse reinforcement learning. (IJCAI 2007)

Syed and Schapire. A game-theoretic approach to apprenticeship learning. (NIPS 2008)

Ziebart, Maas, Bagnell and Dey. Maximum entropy inverse reinforcement learning. (AAAI 2008)

Dvijotham, Todorov. Inverse Optimal Control with Linearly Solvable MDPs (ICML 2010)

More materials / References

apprenticeship learning via inverse reinforcement...

Documents