lifelong learning for disturbance rejection on …eeaton/papers/isele2016work...lifelong learning...

23
Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor 1 Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

Upload: others

Post on 07-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Lifelong Learning for Disturbance Rejection on Mobile Robots

GRASP LABORATORY

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin,

Brandon Kallaher, Matthew E. Taylor

1Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

Page 2: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 2

Page 3: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

… … … …

Time

Current Task

We focus on the lifelong learning case:Agent learns multiple tasks consecutivelyWant stability guarantees as the number of tasks grows large

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 3

Page 4: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 4

Page 5: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control [Kober & Peters 2011;

Peters & Schaal 2008; Sutton et al. 2000]

G

reward function

agent

probabilistic transition

Agent makes sequential decisions

Background: Policy Gradient Methods for Control

•Formalized as a Markov Decision Process (MDP)

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 5

Page 6: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background: Policy Gradient Methods for Control

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 6

Page 7: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background: Policy Gradient Methods for Control

•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy

probability of trajectory reward function

Goal: find policy that minimizes

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 7

Page 8: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 8

Approximate the change in reward with sampled disturbances

Page 9: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 9

Approximate the change in reward with sampled disturbances

Use the pseudo-inverse to find the gradient

Page 10: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Background: Finite Difference Policy Gradients

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 10

Approximate the change in reward with sampled disturbances

Use the pseudo-inverse to find the gradient

Update the current policy

Page 11: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Lifelong PG Learning

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 11

Page 12: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Lifelong Machine Learning

17Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 12

Page 13: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

... ...

Lifelong Machine Learning

19Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 13

Page 14: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

... ...

Lifelong Machine Learning

14Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 14

Page 15: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

... ...

Lifelong Machine Learning

21Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

learned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 15

Page 16: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

... ...

Lifelong Machine Learning

22Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

3.) New knowledge is stored for future uselearned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 16

Page 17: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

... ...

Lifelong Machine Learning

Lifelong Learning System

2.) Knowledge is transferred from previously learned tasks

3.) New knowledge is stored for future use

4.) Existingknowledge is refined

learned policy

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time

1.) Tasks are received consecutively

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 17

Page 18: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Issue: the objective is dependent on all trajectories

PG-ELLA Objective

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 18

Page 19: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Issue: the objective is dependent on all trajectories

PG-ELLA Objective

Hessian

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 19

Page 20: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Verification on Robots

Experiments

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 20

Page 21: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Results for Robot Go-to-Goal Task

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 21

• Run RL on a new robot (goal and disturbance) for a small number of iterations• Use PG-ELLA to adjust policy according to known solutions• Continue training

PG-ELLA improves Learning

Page 22: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

Better Results Incorporating Prior

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 22

• Initialization with average policy of other robots improves benefit

PG-ELLA improves Learning

Page 23: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio

GRASP LABORATORY

Thank you!

Questions?This research was supported by ONR N00014-11-1-0139, AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.

Lifelong Learning for Disturbance Rejection on Mobile Robots

23Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor