# recent advances in hierarchical reinforcement advances in hierarchical reinforcement learning...

Post on 22-Mar-2018

215 views

Embed Size (px)

TRANSCRIPT

PIGML Seminar - AirLab

Recent Advances in HierarchicalReinforcement Learning

Authors:Andrew Barto

Sridhar Mahadevan

Speaker:Alessandro Lazaric

PIGML Seminar - AirLab

Outline

Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

Learning Future/Current/Past Research

PIGML Seminar - AirLab

Outline

Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

Learning Future/Current/Past Research

PIGML Seminar - AirLab

RL as Animal Psychology

Of several responses [actions] made tothe same situation, those which arefollowed by satisfaction to the animalwill be more firmly connected with thesituation, so that, when it recurs, theywill be more likely to recur; those whichare followed by discomfort to theanimal will have their connections withthat situation weakened, so that, whenit recurs, they will be less likely tooccur. The greater the satisfaction ordiscomfort, the greater thestrengthening or weakening of thebond. (Thorndike, 1911, p. 244)

PIGML Seminar - AirLab

RL as Neuroscience

Much evidence suggests thatdopamine cells play an importantrole in reinforcement and actionlearning

Electrophysiological studies supporta theory that dopamine cells signala global prediction error forsummed future reinforcement inappetitive conditioning tasks in theform of a temporal difference (TD)prediction error term

Reinforcement Signal R

Kakade & Dayan (2002)

PIGML Seminar - AirLab

RL as Artificial Intelligence

An artificial agent (either software orhardware) is placed in an environment

The agent perceives the state of the environment acts on the environment through

actions has a goal (planning)

States S Actions A

Environment

Agent

States

Actions

PIGML Seminar - AirLab

RL as Artificial Intelligence

An artificial agent (either software orhardware) is placed in an environment

The agent perceives the state of the environment acts on the environment through

actions has a goal (planning) receives rewards from a critic

States S Actions A Reward R(s,a)

Environment

Agent

Critic

States

Actions

Reward

PIGML Seminar - AirLab

RL as Optimal Control

A control system has sensor (i.e.,states), actuators (i.e., actions) andcosts (i.e., rewards)

The environment is a dynamicalstochastic system

Often, the system can beformalized as Markov DecisionProcess

Optimal control

PIGML Seminar - AirLab

RL as Discrete Time Differential Equations

Value function

Action value function

Bellman equations

Bellman (1957a)

PIGML Seminar - AirLab

RL as Operations Research

Optimal functions

Dynamic Programming (given P and R)

Bellman (1957b)

PIGML Seminar - AirLab

RL as a Milkshake

OperationsResearch

BellmanEquations

AnimalPsychology

OptimalControl

Neuroscience

PIGML Seminar - AirLab

RL as a Machine Learning Paradigm!

Reinforcement Learning is the mostgeneral Machine Learning paradigm

RL is how to map states to actions, soas to maximize a numerical reward inthe long run

RL is a multi-step decision-makingprocess (often Markovian)

An RL agent learns through a model-free trial-and-error process

Actions may affect not only theimmediate reward but alsosubsequent rewards (delayed effect)

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP)

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states Set of actions

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states Set of actions Transition model

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

Solution of an MDP Optimal (action) value function

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning Framework

Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

Solution of an MDP Optimal (action) value function

Optimal policy

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

PIGML Seminar - AirLab

Reinforcement Learning: Q-learning

Q-learning

PIGML Seminar - AirLab

An Example of Reinforcement Learning

http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html

PIGML Seminar - AirLab

Outline

Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

Learning Future/Current/Past Research

PIGML Seminar - AirLab

The need for Hierarchical RL

Curse of dimensionality: the application ofReinforcement Learning to the problems withlarge action and/or state space is infeasible

Abstraction: state and temporal abstractions allowto simplify the problem

Prior knowledge: complex tasks can be oftendecomposed in a hierarchy of sub-tasks

Solution: sub-tasks can be effectively solved byReinforcement Learning approaches

Reuse: sub-tasks and abstract actions can beused in different tasks on the same domain

PIGML Seminar - AirLab

Hierarchical Reinforcement Learning

Hierarchical approach to RL is the introduction oftemporal abstraction to Reinforcement Learningframework

Temporal abstraction is Macro-operators Temporally extended actions Options Sub-tasks Skills Behaviors Modes

PIGML Seminar - AirLab

Hierarchical Reinforcement Learning

From MDPs to SMDPs: with temporally extendedactions we need to take into account the amountof time passed between decision time instants

Semi-Markov Decision Processes

PIGML Seminar - AirLab

Hierarchical RL Approaches

Options Framework

MAXQ Value Function Decomposition

Hierachies of Abstract Machines

PIGML Seminar - AirLab

Options Framework

An option o is defined as:

PIGML Seminar - AirLab

Options Framework

An option o is defined as:

PIGML Seminar - AirLab

Options Framework

An option o is defined as:

PIGML Seminar - AirLab

Options Framework

An option o is defined as:

PIGML Seminar - AirLab

Options Framework

Between MDPs and SMDPs

Continuous timeDiscrete eventsInterval-dependent discount

Discrete timeOverlaid discrete eventsInterval-dependent discount

MDP

SMDP

Options

over MDP

State

Time

Discrete timeHomogeneous discount

Sutton (1999)

PIGML Seminar - AirLab

Options Framework

The introduction of options leads to a straightforwardredefinition of all the elements

Option reward:

Option transition model:

(Hierarchical) Policy over options:

PIGML Seminar - AirLab

Options Framework

Value Function

Action Value Function

SMDP Q-learning

PIGML Seminar - AirLab

Options Framework

Option optimizations Intra option learning: after each primitive action, update

all the options that could have taken that action

Option 1

Option 2

Intra-optionupdate

PIGML Seminar - AirLab

range (input set) of eachrun-to-landmark controller

landmarks

S

G

Options Framework

Option optimizations Termination improvement: interrupt the execution of an

option o whenever there is another option o whoseexepcted reward is greater

S

G

SMDP Solution

(600 Steps)

Termination-Improved

Solution (474 Steps)

PIGML Seminar - AirLab

Options Framework

Pros Options are very simple to implement Options are effective in defining high-level skills Options improve the speed of convergence Options can be used to define hierarchies of o