meetup - deep learning moscow seminar # 9 reinforcement ...files.meetup.com/19039145/seminar9_beyond...

22
Moscow 2016 Reinforcement Learning: Beyond Markov Decision Processes Alexey O. Seleznev PhD in Computational Chemistry 5vision team Deep Learning Moscow Seminar # 9

Upload: others

Post on 29-Jan-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

  • Moscow 2016

    Reinforcement Learning:

    Beyond Markov Decision Processes

    Alexey O. Seleznev

    PhD in Computational Chemistry

    5vision team

    Deep Learning Moscow

    Seminar # 9

  • OUTLINE

    Introduction

    Markov Decision Processes and their Limitations

    Main Point of the Presentation

    Partially Observable Markov Decision Processes

    Bayesian Reinforcement Learning

    Multi-agent Systems

    References

    2

  • Introduction

    Reinforcement Learning (RL):

    • Agent interacts with a dynamic, stochastic, and incompletely known

    environment with the goal of finding a strategy (policy) that optimizes some

    long-term performance measure

    • Unlike supervised machine learning (ML), RL focuses on strategies, not on

    forecasts

    • Examples of tasks:

    3

  • Markov Decision Processes and their Limitations

    To solve RL tasks, we have to formalize the approach

    It turns out that the most convenient way to do it is to utilize a Markov Decision

    Process (MDP), which is comprised of:

    • A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|

    • A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|

    • A reward function, 𝑅: 𝑆 × 𝐴 → ℝ

    • A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎

    • A discount factor, 𝛾 ∈ 0,1

    4

  • 5

    Markov Decision Processes and their Limitations

    Grid world is a good example of a task that can be formulated within the MDP

    framework:

    Agent’s goal is to find a policy 𝜋 𝑠 that would result in the highest cumulativereward within a fixed number of steps

    How to find such a policy?

    cell number is a state

    transition matrix: 0 if dest. is wall, 1 if not

    reward: on the scheme

    actions: up, down, left, right

  • 6

    Markov Decision Processes and their Limitations

    Classical MDP

    Methods

    Model-based

    Conventional model-based

    (Dynamic Programming)

    Bayesian RL

    PAC-MDP

    (E3, Rmax)

    Model-free

    Actor-critic

    Policy-based

    (pure actor)

    REINFORCE, finite-difference

    methods

    Value-based

    (pure critic)

    Monte-CarloTemporal-Difference

    (SARSA, Q-Learning)

  • 7

    Markov Decision Processes and their Limitations

    Limitations of the classical MDP framework:

    The concept of state is the most unrealistic and stylized aspect of MDP

    • What is a state? All information relevant to predicting subsequent

    dynamics and rewards. But we do not talk about one-to-one mapping.

    • Markov property requires that states be organized in such a way that

    history (previous states and actions) is not relevant for predicting

    subsequent dynamics and rewards

    • So, one limitation of MDP is when the Markov property is violated.

    Possible solution: augment states to “full states” by including (i)

    relevant information from other states and/or (ii) previous action

    record. Example: 4 game screens for DQN

    • Another limitation is that in some cases, even full history is not enough to

    determine underlying states. Example: frog in mist, financial market

  • 8

    Markov Decision Processes and their Limitations

    The majority of MDP methods face exploration vs exploitation dilemma

    • Data used for learning in RL depend on the agent

    • Two goals: (i) exploration: to learn as much as possible,

    (ii) exploitation: to obtain as much reward as possible

    • What combination of two objectives will result in greatest long-term

    reward?

    • Existing methods use a variety of techniques to mitigate the dilemma,

    among them:

    Epsilon-greedy strategy

    Boltzmann sampling

    Optimism in the face of uncertainty

    Intrinsic motivation

  • 9

    Markov Decision Processes and their Limitations

    How to operate in multi-agent environments?

  • 10

    Main Point of the Presentation

    Solutions to the difficulties mentioned can be

    formulated within the MDP framework, but with

    specific choices of its components.

  • 11

    Partially Observable Markov Decision Processes

    To get an idea of what it is, let us consider a tiger example:

    N. Daw (2013)

  • 12

    Partially Observable Markov Decision Processes

    Partially Observable Markov Decision Process (POMDP) is comprised of:

    • A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|

    • A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|

    • A reward function, 𝑅: 𝑆 × 𝐴 → ℝ

    • A set of observations: Ω = 𝑜1, 𝑜2, … , 𝑜|O|

    • A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎

    • Conditional observation probabilities : 𝑍𝑖𝑗𝑎 = 𝑃 𝑂𝑡+1 = |𝑗 𝑆𝑡+1 = ⅈ, 𝑎𝑡 = 𝑎

    • A discount factor, 𝛾 ∈ 0,1

  • 13

    Partially Observable Markov Decision Processes

    How to solve POMDP?

    Model-based approach

    1. Reformulate it as MDP. For this purpose, use, e.g.:

    • Belief state MDPs

    • Cross-product MDPs

    2. Solve the MDP obtained by means of e.g.:

    • Policy Iteration

    • Value Iteration

    • Gradient methods

    Model-free approach

    • Incorporating memory (HMM, RNN, Finite State Controllers)

    • Policy-gradient methods D. Braziunas (2003)

  • 14

    Partially Observable Markov Decision Processes

    There exists a direct connection between POMDP and MDP characterized by a

    quadruple :

    D. Braziunas (2003)

    b is a belief state: distribution over states.

  • 15

    Evolution of belief state (example):

    Partially Observable Markov Decision Processes

  • 16

    Partially Observable Markov Decision Processes

    D. Braziunas (2003)

    How to find optimal policy for POMDP?

    policy trees (for finite horizon):

    The optimal t-step value function can be found simply by enumerating all the

    possible policy trees in the set 𝛤𝑡

  • 17

    Partially Observable Markov Decision Processes

    Optimal t-step POMDP value function is piecewise linear and convex in b

    The problem is almost intractable due to its computational complexity. A set of

    simplifications has been suggested.

  • 18

    Bayesian Reinforcement Learning

    In Bayesian RL, we encode unknown 𝑇(𝑠t+1|𝑠t, 𝑎t) with random variables 𝜃𝑖𝑗𝑎

    distributed in accordance with multinomial distribution.

    The agent maintains a posterior belief b over all possible transition models {T} given

    its previous experience and a prior (Dirichlet distribution).

    The task can be reformulated as either POMDP or MDP by redefining the state as

    consisting of observable part S and unobservable parameters of Dirichlet

    distribution. The construction is called superstate: ሚ𝑆 = 𝑆 × 𝜃

    0

    Due to the complexity of the belief state, Bayesian RL is typically intractable in

    terms of both planning and updating the belief after an action. A recent approximate

    solution to Bayesian RL is the Bayesian exploration bonus. Lopes et al. (2012)

    Ross et al. (2011)

  • 19

    Multi-agent Systems

    Stochastic games extend MDPs to multiple agents. The main difference between the

    standard MDP and MDP for multiple players is that each agent is independently

    choosing actions and receiving rewards while the state transitions matrix is defined

    for the full joint-action.

    Dermed at al. (2011)

  • 20

    Multi-agent Systems

    How to solve stochastic games?

    Replace V(s) in Bellman’s equation with an achievable set function

    As a group of n agents follow a joint-policy, each player receives rewards. The

    discounted sum of these rewards is that player’s utility. Joint-utility is a vector

    of players’ utilities.

    An achievable set contains all possible joint-utilities that players can receive

    using policies in equilibrium.

    Dermed at al. (2011)

  • 21

    References

    Nathaniel Daw. in Neuroeconomics (Chapter 16): Advanced Reinforcement

    Learning. Elsevier Inc. (2013)

    Darius Braziunas. POMDP solution methods. Tutorial. University of Toronto (2003)

    Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-Yves Oudeyer. Exploration in

    Model-based Reinforcement Learning by Empirically Estimating Learning Progress.

    in NIPS Proceedings (2012)

    Stephane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann. A Bayesian

    Approach for Learning and Planning in Partially Observable Markov Decision

    Processes. Journal of Machine Learning Research 12 (2011) 1729-1770

    Liam Mac Dermed, Charles L. Isbell, Lora Weiss. Markov Games of Incomplete

    Information for Multi-Agent Reinforcement Learning. Interactive Decision Theory

    and Game Theory: Papers from 2011 AAAI Workshop

  • THANK YOU FOR YOUR ATTENTION !

    22