apprenticeship learning via inverse reinforcement...
TRANSCRIPT
![Page 1: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/1.jpg)
Inverse Reinforcement Learning
Krishnamurthy Dvijotham
UW
(Most slides borrowed/adapted from Pieter Abbeel)
![Page 2: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/2.jpg)
Reinforcement Learning
Almost the same as Optimal Control
“Reinforcement” term coined by psychologists studying animal learning
Focus on discrete state spaces, highly stochastic environments, learning to control without knowing system model in general
Work with rewards instead of costs, usually discounted
![Page 3: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/3.jpg)
Inverse Reinforcement Learning
Dynamics Model
Expert Trajectories
IOC AlgorithmReward
Function R
R explains expert trajectories
![Page 4: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/4.jpg)
Scientific inquiry
Model animal and human behavior
E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]
Apprenticeship learning/Imitation learning through inverse RL
Presupposition: reward function provides the most succinct and transferable definition of the task
Has enabled advancing the state of the art in various robotic domains
Modeling of other agents, both adversarial and cooperative
Motivation for inverse RL
![Page 5: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/5.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
![Page 6: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/6.jpg)
Simulated highway driving
Abbeel and Ng, ICML 2004,
Syed and Schapire, NIPS 2007
Aerial imagery based navigation
Ratliff, Bagnell and Zinkevich, ICML 2006
Parking lot navigation
Abbeel, Dolgov, Ng and Thrun, IROS 2008
Urban navigation
Ziebart, Maas, Bagnell and Dey, AAAI 2008
Quadruped locomotion
Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007
Kolter, Abbeel and Ng, NIPS 2008
Examples
![Page 7: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/7.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
![Page 8: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/8.jpg)
Input:
State space, action space
Transition model Psa(st+1 | st, at)
No reward function
Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …
(= trace of the teacher’s policy *)
Inverse RL:
Can we recover R ?
Apprenticeship learning via inverse RL
Can we then use this R to find a good policy ?
Behavioral cloning
Can we directly learn the teacher’s policy using supervised learning?
Problem setup
![Page 9: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/9.jpg)
Formulate as standard machine learning problem
Fix a policy class
E.g., support vector machine, neural network, decision tree, deep belief net, …
Estimate a policy (=mapping from states to actions) from the training examples (s0, a0), (s1, a1), (s2, a2), …
Two of the most notable success stories:
Pomerleau, NIPS 1989: ALVINN
Sammut et al., ICML 1992: Learning to fly (flight sim)
Behavioral cloning
![Page 10: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/10.jpg)
Which has the most succinct description: ¼* vs. R*?
Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.
Inverse RL vs. behavioral cloning
![Page 11: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/11.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
![Page 12: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/12.jpg)
1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case
1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting
2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested
2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching
2006, Ratliff+al: max margin formulation
Inverse RL history
![Page 13: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/13.jpg)
2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features
2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class
2008, Kolter, Abbeel and Ng: hierarchical max-margin
2008, Syed and Schapire: feature matching + game theoretic formulation
2008, Ziebart+al: feature matching + max entropy
2010, Dvijotham, Todorov: Inverse RL with LMDPs
Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ?
Inverse RL history
![Page 14: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/14.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
![Page 15: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/15.jpg)
IRL with LMDP : OptV
Algorithm (OptV):1. Input: Transitions {(x_n,y_n)} sampled
from u*, Dynamics P(x’|x)2. Estimate v by maximizing log-likelihood:
3. Compute q from v using
No need to solve opt control prob repeatedly
![Page 16: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/16.jpg)
Continuous State Spaces
How to represent v in a continuous state space?
Typical Sample (2d) Local Gaussian Bases
Discretize : Blows up exponentially with dimension
Intuition
![Page 17: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/17.jpg)
RBFs
Representation of v in continuous state spaces?
1. Lots of approximating power in highly visited regions
2. Low approximating power (smoother approximation) in low-density regions
![Page 18: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/18.jpg)
Basis Adaptation
Complete log-likelihood:
Algorithm (OptVA)1. Initialize θ “cleverly”2. Until Convergence:
1. LBFGS to compute L*(θ)2. Conjugate Gradient ascent on L*(θ)
![Page 19: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/19.jpg)
OptQ
Log-likelihood of set of trajectories
Observe set of expert trajectories
Express z in terms of q
Resulting expression convex in q : Optimize!
![Page 20: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/20.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
IRL with LMDPs
IRL with Traditional MDPs
Examples
Lecture outline
![Page 21: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/21.jpg)
IRL in Traditional MDPs
GridWorld : •Controls:•First-Exit : Minimize cost until you hit goal
IRL Ill-Posed(Ng ’00)
![Page 22: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/22.jpg)
IRL in Traditional MDPs
Let be dynamics under policy Π. Then, any reward function satisfying
makes Π optimal
![Page 23: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/23.jpg)
Find a reward function R* which explains the expert
behaviour.
Find R* such that
In fact a convex feasibility problem, but many challenges:
R=0 is a solution, more generally: reward function ambiguity
We typically only observe expert traces rather than the entire expert policy ¼* --- how to compute LHS?
Assumes the expert is indeed optimal --- otherwise infeasible
Computationally: assumes we can enumerate all policies
IRL in MDPs : Basic principle
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
![Page 24: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/24.jpg)
ff
Subbing into
gives us:
Feature based reward function
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.
E[
1X
t=0
°tR(st)j¼] = E[
1X
t=0
°tw>Á(st)j¼]
= w>E[
1X
t=0
°tÁ(st)j¼]
= w>¹(¼)
Expected cumulative discounted sum of feature values or “feature expectations”
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
![Page 25: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/25.jpg)
Feature expectations can be readily estimated from sample trajectories.
The number of expert demonstrations required scales with the number of features in the reward function.
The number of expert demonstration required does not depend on
Complexity of the expert’s optimal policy ¼*
Size of the state space directly
Feature based reward function
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
![Page 26: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/26.jpg)
Challenges:
Assumes we know the entire expert policy ¼*
assumes we can estimate expert feature expectations
R=0 is a solution (now: w=0), more generally: reward function ambiguity
Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness!
Computationally: assumes we can enumerate all policies
Recap of challenges
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
![Page 27: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/27.jpg)
Standard max margin:
“Structured prediction” max margin:
Justification: margin should be larger for policies that are very different from ¼*.
Example: m(¼, ¼*) = number of states in which ¼* was observed and in which ¼ and ¼* disagree
Ambiguity
minwkwk22
s:t: w>¹(¼¤) ¸ w>¹(¼) + 1 8¼
minwkwk22
s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼) 8¼
![Page 28: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/28.jpg)
Structured prediction max margin with slack variables:
Can be generalized to multiple MDPs (could also be same MDP with different initial state)
Expert suboptimality
minwkwk22 + C»
s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼)¡ » 8¼
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)
![Page 29: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/29.jpg)
Resolved: access to ¼*, ambiguity, expert suboptimality
One challenge remains: very large number of constraints
Ratliff+al use subgradient methods.
In this lecture: constraint generation
Complete max-margin formulation
[Ratliff, Zinkevich and Bagnell, 2006]
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)
![Page 30: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/30.jpg)
Initialize ¦(i) = {} for all i and then iterate
Solve
For current value of w, find the most violated constraint for all i by solving:
= find the optimal policy for the current estimate of the reward function (+ loss augmentation m)
For all i add ¼(i) to ¦(i)
If no constraint violations were found, we are done.
Constraint generation
max¼(i)
w>¹(¼(i)) + m(¼(i)¤; ¼(i))
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; 8¼(i) 2 ¦(i)
![Page 31: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/31.jpg)
Every policy ¼ has a corresponding feature expectation vector ¹(¼),
which for visualization purposes we assume to be 2D
Visualization in feature expectation space
1
2
(*)
E[
1X
t=0
°tR(st)j¼] = w>¹(¼)
max margin
structured max margin (?)
wmm
wsmm
![Page 32: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/32.jpg)
Every policy ¼ has a corresponding feature expectation vector ¹(¼),
which for visualization purposes we assume to be 2D
Constraint generation
1
(0)
(2)
2
(*)
(1)
max¼
E[
1X
t=0
°tRw(st)j¼] = max¼
w>¹(¼)
w(2)
w(3) = wmm
constraint generation:
![Page 33: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/33.jpg)
Inverse RL starting point: find a reward function such that the expert outperforms other policies
Observation in Abbeel and Ng, 2004: for a policy ¼ to be guaranteed to perform as well as the expert policy ¼*, it
suffices that the feature expectations match:
implies that for all w with
Feature matching
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.
k¹(¼)¡¹(¼¤)k1 · ²
kwk1 · 1:
E[
1X
t=0
°tRw¤(st)j¼] = w¤>¹(¼) ¸ w¤>¹(¼¤)¡ ² = E[
1X
t=0
°tRw¤(st)j¼¤]¡ ²
Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
Holder's inequality:
8p; q ¸ 1 if1
p+
1
q= 1; then x>y · kxkpkykq
![Page 34: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/34.jpg)
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
History of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
![Page 35: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/35.jpg)
GridWorld CPU times
![Page 36: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/36.jpg)
Results
•Colored plots represent functions f(x,y)•Red - High, Blue – Low•Continuous state problems, 2d state space•x-axis : Position, y-axis: Velocity
![Page 37: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/37.jpg)
Results on Continuous Problems
True Cost
![Page 38: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/38.jpg)
Results on Continuous Problems
True Cost
![Page 39: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/39.jpg)
Comparison to Other Algorithms
Optimal Policy
OptVEstimate
2 min
Ng ’00 Estimate
Abbeel ’04 Estimate
Discretized controls to for other algos to make them run tractably
![Page 40: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/40.jpg)
Example of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Sketch of history of inverse RL
Mathematical formulations for inverse RL
Case studies
Inverse Optimal Control with LMDPs
Lecture outline
![Page 41: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/41.jpg)
Inverse RL vs. behavioral cloning
Sketch of history of inverse RL
Mathematical formulations for inverse RL
Inverse RL with LMDPs vs Inverse RL with Traditional MDPs
Summary
![Page 42: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,](https://reader033.vdocuments.site/reader033/viewer/2022042017/5e7594d4a0037e51660dbd44/html5/thumbnails/42.jpg)
Ng and Russell. Algorithms for inverse reinforcement Learning (ICML 2000)
Abbeel and Ng. Apprenticeship learning via inverse reinforcement learning. (ICML 2004)
Boyd, Ghaoui, Feron and Balakrishnan. (1994). Linear matrix inequalities in system and control theory. SIAM.
Kolter, Abbeel and Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. (NIPS 2008)
Neu and Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. (UAI 2007)
Ng and Russell. Algorithms for inverse reinforcement learning. (ICML 2000)
Ratliff, Bagnell and Zinkevich. Maximum margin planning. (ICML 2006)
Ratliff, Bradley, Bagnell and Chestnutt. Boosting structured prediction for imitation learning. (NIPS 2007)
Ramachandran and Amir. Bayesian inverse reinforcement learning. (IJCAI 2007)
Syed and Schapire. A game-theoretic approach to apprenticeship learning. (NIPS 2008)
Ziebart, Maas, Bagnell and Dey. Maximum entropy inverse reinforcement learning. (AAAI 2008)
Dvijotham, Todorov. Inverse Optimal Control with Linearly Solvable MDPs (ICML 2010)
More materials / References