![Page 1: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/1.jpg)
Apprenticeship Learning for Robotic Control
Pieter AbbeelStanford University
Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
![Page 2: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/2.jpg)
Pieter Abbeel
Motivation for apprenticeship learning
![Page 3: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/3.jpg)
Pieter Abbeel
Preliminary: reinforcement learning.
Apprenticeship learning algorithms.
Experimental results on various robotic platforms.
Outline
![Page 4: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/4.jpg)
Pieter Abbeel
Reinforcement learning (RL)
System
Dynamics
Psa
state s0
s1
System
dynamics
Psa
…
System
Dynamics
PsasT-1
sT
s2
a0 a1 aT-1
reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++
Example reward function: R(s) = - || s – s* ||
Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]
Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T.
![Page 5: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/5.jpg)
Pieter Abbeel
Model-based reinforcement learning
Run RL algorithm in simulator.
Control policy
![Page 6: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/6.jpg)
Pieter Abbeel
Apprenticeship learning algorithms use a demonstration to help us find
a good dynamics model,
a good reward function,
a good control policy.
Reinforcement learning (RL)
Reward Function
R
ReinforcementLearning Control
policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 7: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/7.jpg)
Pieter Abbeel
Apprenticeship learning for the dynamics model
Reward Function R
ReinforcementLearning
Control policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 8: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/8.jpg)
Pieter Abbeel
Accurate dynamics model Psa
Motivating example
•Textbook model• Specification
Accurate dynamics model Psa
Collect flight data.
•Textbook model• Specification
Learn model from data.
How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?
![Page 9: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/9.jpg)
Pieter Abbeel
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
![Page 10: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/10.jpg)
Pieter Abbeel
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Exploration policies are impractical: they do not even try
to perform well.Can we avoid explicit exploration and just
exploit?
![Page 11: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/11.jpg)
Pieter Abbeel[ICML 2005]
Apprenticeship learning of the model
Teacher: human pilot flight
(a1, s1, a2, s2, a3, s3, ….)
Learn P sa
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
Dynamics Model
Psa
Reward Function R
ReinforcementLearning )(...)(Emax 0 TsRsR
Control policy
No explicit exploration, always try to fly as well as
possible.
![Page 12: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/12.jpg)
Pieter Abbeel
Assuming a polynomial number of teacher demonstrations,
then after a polynomial number of trials, with probability 1-
E [ sum of rewards | policy returned by algorithm ]
¸ E [ sum of rewards | teacher’s policy] - .
Here, polynomial is with respect to
1/, 1/,
the horizon T,
the maximum reward R,
the size of the state space.
Theorem.
![Page 13: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/13.jpg)
Pieter Abbeel
Learning the dynamics model Details of algorithm for learning dynamics model:
Exploiting structure from physics Lagged learning criterion
[NIPS 2005, 2006]
![Page 14: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/14.jpg)
Pieter Abbeel
Helicopter flight results First high-speed autonomous funnels.
Speed: 5m/s. Nominal pitch angle: 30 degrees.
30o
![Page 15: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/15.jpg)
Pieter Abbeel
Autonomous nose-in funnel
![Page 16: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/16.jpg)
Pieter Abbeel
Accuracy
![Page 17: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/17.jpg)
Pieter Abbeel
Autonomous tail-in funnel
![Page 18: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/18.jpg)
Pieter Abbeel
Key points
Unlike exploration methods, our algorithm concentrates on the task of interest.
Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher.
![Page 19: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/19.jpg)
![Page 20: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/20.jpg)
Pieter Abbeel
Apprenticeship learning: reward
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 21: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/21.jpg)
Pieter Abbeel
Example task: driving
![Page 22: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/22.jpg)
Pieter Abbeel
Related work
Previous work: Learn to predict teacher’s actions as a function of states. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et
al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …
Assumes “policy simplicity.”
Our approach: Assumes “reward simplicity” and is based on inverse
reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al., 2006, 2007.
![Page 23: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/23.jpg)
Pieter Abbeel
Find R s.t. R is consistent with the teacher’s policy * being optimal.
Find R s.t.:
Find w:
Linear constraints in w, quadratic objective QP. Very large number of constraints.
Inverse reinforcement learning
![Page 24: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/24.jpg)
Pieter Abbeel
Algorithm
For i = 1, 2, …
Inverse RL step:
RL step: (= constraint generation)
Compute optimal policy i for the estimated reward Rw.
![Page 25: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/25.jpg)
Pieter Abbeel
Theorem. After at most nT 2/2 iterations our algorithm
returns a policy that performs as well as the teacher according to the teacher’s unknown reward function, i.e.,
Note: Our algorithm does not necessarily recover the teacher’s reward function R* --- which is impossible to recover.
Theoretical guarantees
[ICML 2004]
![Page 26: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/26.jpg)
Pieter Abbeel
Performance guarantee intuition Intuition by example:
Let
If the returned policy satisfies
Then no matter what the values of and are, the policy performs as well as the teacher’s policy *.
![Page 27: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/27.jpg)
Pieter Abbeel
Case study: Highway drivingInput: Driving demonstration Output: Learned behavior
The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.
![Page 28: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/28.jpg)
Pieter Abbeel
More driving examples
In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Driving demonstratio
n
Driving demonstrati
on
Learned behavior
Learned behavior
![Page 29: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/29.jpg)
Pieter Abbeel
Helicopter
Reward Function R
ReinforcementLearning Control
policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989]
25 features
[NIPS 2007]
![Page 30: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/30.jpg)
Pieter Abbeel
Autonomous aerobatics [Show helicopter movie in Media Player.]
![Page 31: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/31.jpg)
Pieter Abbeel
Quadruped
![Page 32: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/32.jpg)
Pieter Abbeel
Reward function trades off: Height differential of terrain.
Gradient of terrain around each foot.
Height differential between feet.
… (25 features total for our setup)
Quadruped
![Page 33: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/33.jpg)
Pieter Abbeel
Teacher demonstration for quadruped
Full teacher demonstration = sequence of footsteps.
Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.
![Page 34: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/34.jpg)
Pieter Abbeel
Hierarchical inverse RL
Quadratic programming problem (QP): quadratic objective, linear constraints.
Constraint generation for path constraints.
![Page 35: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/35.jpg)
Pieter Abbeel
Training: Have quadruped walk straight across a fairly
simple board with fixed-spaced foot placements. Around each foot placement: label the best foot
placement. (about 20 labels) Label the best body-path for the training board.
Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.
Test on hold-out terrains: Plan a path across the test-board.
Experimental setup
![Page 36: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/36.jpg)
Pieter Abbeel
Quadruped on test-board
[Show movie in Media Player.]
![Page 37: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/37.jpg)
![Page 38: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/38.jpg)
Pieter Abbeel
Apprenticeship learning: RL algorithm
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
(Sloppy) demonstration
(Crude) model
Small number of real-life trials
![Page 39: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/39.jpg)
Pieter Abbeel
Experiments Two Systems:
RC car Fixed-wing flight simulator
Control actions: throttle and steering.
![Page 40: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/40.jpg)
Pieter Abbeel
RC Car: Circle
![Page 41: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/41.jpg)
Pieter Abbeel
RC Car: Figure-8 Maneuver
![Page 42: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/42.jpg)
Pieter Abbeel
Conclusion
Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations.
Our current work exploits teacher demonstrations to find
a good dynamics model,
a good reward function,
a good control policy.
![Page 43: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley](https://reader030.vdocuments.site/reader030/viewer/2022032414/56649eef5503460f94bfed4d/html5/thumbnails/43.jpg)
Pieter Abbeel
Acknowledgments
J. Zico Kolter, Andrew Y. Ng
Morgan Quigley, Andrew Y. Ng
Andrew Y. Ng
Adam Coates, Morgan Quigley, Andrew Y. Ng