recent advances in hierarchical reinforcement advances in hierarchical reinforcement learning...

Download Recent Advances in Hierarchical Reinforcement   Advances in Hierarchical Reinforcement Learning Authors: ... Artificial Intelligence ... pickup a passanger at one

Post on 22-Mar-2018

215 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • PIGML Seminar - AirLab

    Recent Advances in HierarchicalReinforcement Learning

    Authors:Andrew Barto

    Sridhar Mahadevan

    Speaker:Alessandro Lazaric

  • PIGML Seminar - AirLab

    Outline

    Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

    Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

    Learning Future/Current/Past Research

  • PIGML Seminar - AirLab

    Outline

    Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

    Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

    Learning Future/Current/Past Research

  • PIGML Seminar - AirLab

    RL as Animal Psychology

    Of several responses [actions] made tothe same situation, those which arefollowed by satisfaction to the animalwill be more firmly connected with thesituation, so that, when it recurs, theywill be more likely to recur; those whichare followed by discomfort to theanimal will have their connections withthat situation weakened, so that, whenit recurs, they will be less likely tooccur. The greater the satisfaction ordiscomfort, the greater thestrengthening or weakening of thebond. (Thorndike, 1911, p. 244)

  • PIGML Seminar - AirLab

    RL as Neuroscience

    Much evidence suggests thatdopamine cells play an importantrole in reinforcement and actionlearning

    Electrophysiological studies supporta theory that dopamine cells signala global prediction error forsummed future reinforcement inappetitive conditioning tasks in theform of a temporal difference (TD)prediction error term

    Reinforcement Signal R

    Kakade & Dayan (2002)

  • PIGML Seminar - AirLab

    RL as Artificial Intelligence

    An artificial agent (either software orhardware) is placed in an environment

    The agent perceives the state of the environment acts on the environment through

    actions has a goal (planning)

    States S Actions A

    Environment

    Agent

    States

    Actions

  • PIGML Seminar - AirLab

    RL as Artificial Intelligence

    An artificial agent (either software orhardware) is placed in an environment

    The agent perceives the state of the environment acts on the environment through

    actions has a goal (planning) receives rewards from a critic

    States S Actions A Reward R(s,a)

    Environment

    Agent

    Critic

    States

    Actions

    Reward

  • PIGML Seminar - AirLab

    RL as Optimal Control

    A control system has sensor (i.e.,states), actuators (i.e., actions) andcosts (i.e., rewards)

    The environment is a dynamicalstochastic system

    Often, the system can beformalized as Markov DecisionProcess

    Optimal control

  • PIGML Seminar - AirLab

    RL as Discrete Time Differential Equations

    Value function

    Action value function

    Bellman equations

    Bellman (1957a)

  • PIGML Seminar - AirLab

    RL as Operations Research

    Optimal functions

    Dynamic Programming (given P and R)

    Bellman (1957b)

  • PIGML Seminar - AirLab

    RL as a Milkshake

    OperationsResearch

    BellmanEquations

    AnimalPsychology

    OptimalControl

    Neuroscience

  • PIGML Seminar - AirLab

    RL as a Machine Learning Paradigm!

    Reinforcement Learning is the mostgeneral Machine Learning paradigm

    RL is how to map states to actions, soas to maximize a numerical reward inthe long run

    RL is a multi-step decision-makingprocess (often Markovian)

    An RL agent learns through a model-free trial-and-error process

    Actions may affect not only theimmediate reward but alsosubsequent rewards (delayed effect)

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP)

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states Set of actions

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states Set of actions Transition model

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

    Solution of an MDP Optimal (action) value function

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning Framework

    Markov Decision Process (MDP) Set of states Set of actions Transition model Reward function Discount factor:

    Solution of an MDP Optimal (action) value function

    Optimal policy

    0 1 2 3 4

    5 6 7 8 9

    10 11 12 13 14

    15 16 17 18 19

    20 21 22 23 24

  • PIGML Seminar - AirLab

    Reinforcement Learning: Q-learning

    Q-learning

  • PIGML Seminar - AirLab

    An Example of Reinforcement Learning

    http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html

  • PIGML Seminar - AirLab

    Outline

    Introduction to Reinforcement Learning Reinforcement Learning Inspirations and Foundations Markov Decision Processes (MDPs) and Q-learning

    Hierarchical Reinforcement Learning From MDPs to SMDPs Option Framework MAXQ Value Function Decomposition Other Approaches to Hierarchical Reinforcement

    Learning Future/Current/Past Research

  • PIGML Seminar - AirLab

    The need for Hierarchical RL

    Curse of dimensionality: the application ofReinforcement Learning to the problems withlarge action and/or state space is infeasible

    Abstraction: state and temporal abstractions allowto simplify the problem

    Prior knowledge: complex tasks can be oftendecomposed in a hierarchy of sub-tasks

    Solution: sub-tasks can be effectively solved byReinforcement Learning approaches

    Reuse: sub-tasks and abstract actions can beused in different tasks on the same domain

  • PIGML Seminar - AirLab

    Hierarchical Reinforcement Learning

    Hierarchical approach to RL is the introduction oftemporal abstraction to Reinforcement Learningframework

    Temporal abstraction is Macro-operators Temporally extended actions Options Sub-tasks Skills Behaviors Modes

  • PIGML Seminar - AirLab

    Hierarchical Reinforcement Learning

    From MDPs to SMDPs: with temporally extendedactions we need to take into account the amountof time passed between decision time instants

    Semi-Markov Decision Processes

  • PIGML Seminar - AirLab

    Hierarchical RL Approaches

    Options Framework

    MAXQ Value Function Decomposition

    Hierachies of Abstract Machines

  • PIGML Seminar - AirLab

    Options Framework

    An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

    An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

    An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

    An option o is defined as:

  • PIGML Seminar - AirLab

    Options Framework

    Between MDPs and SMDPs

    Continuous timeDiscrete eventsInterval-dependent discount

    Discrete timeOverlaid discrete eventsInterval-dependent discount

    MDP

    SMDP

    Options

    over MDP

    State

    Time

    Discrete timeHomogeneous discount

    Sutton (1999)

  • PIGML Seminar - AirLab

    Options Framework

    The introduction of options leads to a straightforwardredefinition of all the elements

    Option reward:

    Option transition model:

    (Hierarchical) Policy over options:

  • PIGML Seminar - AirLab

    Options Framework

    Value Function

    Action Value Function

    SMDP Q-learning

  • PIGML Seminar - AirLab

    Options Framework

    Option optimizations Intra option learning: after each primitive action, update

    all the options that could have taken that action

    Option 1

    Option 2

    Intra-optionupdate

  • PIGML Seminar - AirLab

    range (input set) of eachrun-to-landmark controller

    landmarks

    S

    G

    Options Framework

    Option optimizations Termination improvement: interrupt the execution of an

    option o whenever there is another option o whoseexepcted reward is greater

    S

    G

    SMDP Solution

    (600 Steps)

    Termination-Improved

    Solution (474 Steps)

  • PIGML Seminar - AirLab

    Options Framework

    Pros Options are very simple to implement Options are effective in defining high-level skills Options improve the speed of convergence Options can be used to define hierarchies of o

Recommended

View more >