recent advances in hierarchical reinforcement learning •has a goal (planning) ......

Download Recent Advances in Hierarchical Reinforcement Learning •has a goal (planning) ... •Termination improvement:

If you can't read please download the document

Post on 18-Oct-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • PIGML Seminar - AirLab

    Recent Advances in Hierarchical Reinforcement Learning

    Authors: Andrew Barto

    Sridhar Mahadevan

    Speaker: Alessandro Lazaric

  • PIGML Seminar - AirLab

    Outline

     Introduction to Reinforcement Learning • Reinforcement Learning Inspirations and Foundations • Markov Decision Processes (MDPs) and Q-learning

     Hierarchical Reinforcement Learning • From MDPs to SMDPs • Option Framework • MAXQ Value Function Decomposition • Other Approaches to Hierarchical Reinforcement

    Learning • Future/Current/Past Research

  • PIGML Seminar - AirLab

    Outline

     Introduction to Reinforcement Learning • Reinforcement Learning Inspirations and Foundations • Markov Decision Processes (MDPs) and Q-learning

     Hierarchical Reinforcement Learning • From MDPs to SMDPs • Option Framework • MAXQ Value Function Decomposition • Other Approaches to Hierarchical Reinforcement

    Learning • Future/Current/Past Research

  • PIGML Seminar - AirLab

    RL as… Animal Psychology

     Of several responses [actions] made to the same situation, those which are followed by satisfaction to the animal will be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are followed by discomfort to the animal will have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911, p. 244)

  • PIGML Seminar - AirLab

    RL as… Neuroscience

     Much evidence suggests that dopamine cells play an important role in reinforcement and action learning

     Electrophysiological studies support a theory that dopamine cells signal a global prediction error for summed future reinforcement in appetitive conditioning tasks in the form of a temporal difference (TD) prediction error term

     Reinforcement Signal R

    Kakade & Dayan (2002)

  • PIGML Seminar - AirLab

    RL as… Artificial Intelligence

     An artificial agent (either software or hardware) is placed in an environment

     The agent • perceives the state of the environment • acts on the environment through

    actions • has a goal (planning)

     States S  Actions A

    Environment

    Agent

    States

    Actions

  • PIGML Seminar - AirLab

    RL as… Artificial Intelligence

     An artificial agent (either software or hardware) is placed in an environment

     The agent • perceives the state of the environment • acts on the environment through

    actions • has a goal (planning) • receives rewards from a critic

     States S  Actions A  Reward R(s,a)

    Environment

    Agent

    Critic

    States

    Actions

    Reward

  • PIGML Seminar - AirLab

    RL as… Optimal Control

     A control system has sensor (i.e., states), actuators (i.e., actions) and costs (i.e., rewards)

     The environment is a dynamical stochastic system

     Often, the system can be formalized as Markov Decision Process

     Optimal control

  • PIGML Seminar - AirLab

    RL as… Discrete Time Differential Equations

     Value function

     Action value function

     Bellman equations

    Bellman (1957a)

  • PIGML Seminar - AirLab

    RL as… Operations Research

     Optimal functions

     Dynamic Programming (given P and R)

    Bellman (1957b)

  • PIGML Seminar - AirLab

    RL as… a Milkshake

    Operations Research

    Bellman Equations

    Animal Psychology

    Optimal Control

    Neuroscience

  • PIGML Seminar - AirLab

    RL as… a Machine Learning Paradigm!

     Reinforcement Learning is the most general Machine Learning paradigm

     RL is how to map states to actions, so as to maximize a numerical reward in the long run

     RL is a multi-step decision-making process (often Markovian)

     An RL agent learns through a model- free trial-and-error process

     Actions may affect not only the immediate reward but also subsequent rewards (delayed effect)

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP)

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states • Set of actions

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states • Set of actions • Transition model

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states • Set of actions • Transition model • Reward function • Discount factor: γ

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states • Set of actions • Transition model • Reward function • Discount factor: γ

     Solution of an MDP • Optimal (action) value function

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

     Markov Decision Process (MDP) • Set of states • Set of actions • Transition model • Reward function • Discount factor: γ

     Solution of an MDP • Optimal (action) value function

    • Optimal policy

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning: Q-learning

     Q-learning

  • PIGML Seminar - AirLab

    An Example of Reinforcement Learning

    http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html

  • PIGML Seminar - AirLab

    Outline

     Introduction to Reinforcement Learning • Reinforcement Learning Inspirations and Foundations • Markov Decision Processes (MDPs) and Q-learning

     Hierarchical Reinforcement Learning • From MDPs to SMDPs • Option Framework • MAXQ Value Function Decomposition • Other Approaches to Hierarchical Reinforcement

    Learning • Future/Current/Past Research

  • PIGML Seminar - AirLab

    The need for Hierarchical RL

     Curse of dimensionality: the application of Reinforcement Learning to the problems with large action and/or state space is infeasible

     Abstraction: state and temporal abstractions allow to simplify the problem

     Prior knowledge: complex tasks can be often decomposed in a hierarchy of sub-tasks

     Solution: sub-tasks can be effectively solved by Reinforcement Learning approaches

     Reuse: sub-tasks and abstract actions can be used in different tasks on the same domain

  • PIGML Seminar - AirLab

    Hierarchical Reinforcement Learning

     Hierarchical approach to RL is the introduction of temporal abstraction to Reinforcement Learning framework

     Temporal abstraction is • Macro-operators • Temporally extended actions • Options • Sub-tasks • Skills • Behaviors • Modes

  • PIGML Seminar - AirLab

    Hierarchical Reinforcement Learning

     From MDPs to SMDPs: with temporally extended actions we need to take into account the amount of time passed between decision time instants

     Semi-Markov Decision Processes

  • PIGML Seminar - AirLab

    Hierarchical RL Approaches

     Options Framework

     MAXQ Value Function Decomposition

     Hierachies of Abstract Machines

  • PIGML Seminar - AirLab

    Options Framework

     An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

     An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

     An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

     An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

     Between MDPs and SMDPs

    Continuous time Discrete events Interval-dependent discount

    Discrete time Overlaid discrete events Interval-dependent discount

    MDP

    SMDP

    Options

    over MDP

    State

    Time

    Discrete time Homogeneous discount

    Sutton (1999)

  • PIGML Seminar - AirLab

    Options Framework

     The introduction of options leads to a straightforward redefinition of all the elemen