lifelong learning for disturbance rejection on …eeaton/papers/isele2016work...lifelong learning...

Lifelong Learning for Disturbance Rejection on Mobile Robots

GRASP LABORATORY

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin,

Brandon Kallaher, Matthew E. Taylor

1Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 2

Problem 1: Without prior knowledge, RL in a new task is slow

Idea: Reuse knowledge from previously learned tasks

Motivation

G

standard“tabula rasa” initialization initialization via

transfer

… … … …

Time

Current Task

We focus on the lifelong learning case:Agent learns multiple tasks consecutivelyWant stability guarantees as the number of tasks grows large


Background


•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces

–Have shown recent success in applications to robotic control [Kober & Peters 2011;

Peters & Schaal 2008; Sutton et al. 2000]

G

reward function

agent

probabilistic transition

Agent makes sequential decisions

Background: Policy Gradient Methods for Control

•Formalized as a Markov Decision Process (MDP)




–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy




–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]

n trajectories

Policy GradientLearner

Policy

probability of trajectory reward function

Goal: find policy that minimizes


Background: Finite Difference Policy Gradients


Approximate the change in reward with sampled disturbances




Use the pseudo-inverse to find the gradient




Use the pseudo-inverse to find the gradient

Update the current policy

Lifelong PG Learning


Lifelong Machine Learning

17Lifelong Learning System

previously learnedknowledge

previously learned tasks future learning tasks

... ...tt-1t-2t-3 t+1 t+2 t+3

current task

Time

1.) Tasks are received consecutively


... ...





... ...tt-1t-2t-3 t+1 t+2 t+3

trajectories for task t

current task

Time



... ...





... ...tt-1t-2t-3 t+1 t+2 t+3


current task

Time



... ...



2.) Knowledge is transferred from previously learned tasks

learned policy



... ...tt-1t-2t-3 t+1 t+2 t+3


current task

Time



... ...




3.) New knowledge is stored for future uselearned policy



... ...tt-1t-2t-3 t+1 t+2 t+3


current task

Time



... ...


Lifelong Learning System


3.) New knowledge is stored for future use

4.) Existingknowledge is refined

learned policy



... ...tt-1t-2t-3 t+1 t+2 t+3


current task

Time



Issue: the objective is dependent on all trajectories

PG-ELLA Objective


Issue: the objective is dependent on all trajectories

PG-ELLA Objective

Hessian


Verification on Robots

Experiments


Results for Robot Go-to-Goal Task


• Run RL on a new robot (goal and disturbance) for a small number of iterations• Use PG-ELLA to adjust policy according to known solutions• Continue training

PG-ELLA improves Learning

Better Results Incorporating Prior


• Initialization with average policy of other robots improves benefit

PG-ELLA improves Learning

GRASP LABORATORY

Thank you!

Questions?This research was supported by ONR N00014-11-1-0139, AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.

Lifelong Learning for Disturbance Rejection on Mobile Robots

23Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor

David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor

lifelong learning for disturbance rejection on …eeaton/papers/isele2016work...lifelong learning...

Documents