meetup - deep learning moscow seminar # 9 reinforcement ...files.meetup.com/19039145/seminar9_beyond...

Moscow 2016

Reinforcement Learning:

Beyond Markov Decision Processes

Alexey O. Seleznev

PhD in Computational Chemistry

5vision team

Deep Learning Moscow

Seminar # 9

OUTLINE

Introduction

Markov Decision Processes and their Limitations

Main Point of the Presentation

Partially Observable Markov Decision Processes

Bayesian Reinforcement Learning

Multi-agent Systems

References

2

Introduction

Reinforcement Learning (RL):

• Agent interacts with a dynamic, stochastic, and incompletely known

environment with the goal of finding a strategy (policy) that optimizes some

long-term performance measure

• Unlike supervised machine learning (ML), RL focuses on strategies, not on

forecasts

• Examples of tasks:

3


To solve RL tasks, we have to formalize the approach

It turns out that the most convenient way to do it is to utilize a Markov Decision

Process (MDP), which is comprised of:

• A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|

• A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|

• A reward function, 𝑅: 𝑆 × 𝐴 → ℝ

• A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎

• A discount factor, 𝛾 ∈ 0,1

4

5


Grid world is a good example of a task that can be formulated within the MDP

framework:

Agent’s goal is to find a policy 𝜋 𝑠 that would result in the highest cumulativereward within a fixed number of steps

How to find such a policy?

cell number is a state

transition matrix: 0 if dest. is wall, 1 if not

reward: on the scheme

actions: up, down, left, right

6


Classical MDP

Methods

Model-based

Conventional model-based

(Dynamic Programming)

Bayesian RL

PAC-MDP

(E3, Rmax)

Model-free

Actor-critic

Policy-based

(pure actor)

REINFORCE, finite-difference

methods

Value-based

(pure critic)

Monte-CarloTemporal-Difference

(SARSA, Q-Learning)

7


Limitations of the classical MDP framework:

The concept of state is the most unrealistic and stylized aspect of MDP

• What is a state? All information relevant to predicting subsequent

dynamics and rewards. But we do not talk about one-to-one mapping.

• Markov property requires that states be organized in such a way that

history (previous states and actions) is not relevant for predicting

subsequent dynamics and rewards

• So, one limitation of MDP is when the Markov property is violated.

Possible solution: augment states to “full states” by including (i)

relevant information from other states and/or (ii) previous action

record. Example: 4 game screens for DQN

• Another limitation is that in some cases, even full history is not enough to

determine underlying states. Example: frog in mist, financial market

8


The majority of MDP methods face exploration vs exploitation dilemma

• Data used for learning in RL depend on the agent

• Two goals: (i) exploration: to learn as much as possible,

(ii) exploitation: to obtain as much reward as possible

• What combination of two objectives will result in greatest long-term

reward?

• Existing methods use a variety of techniques to mitigate the dilemma,

among them:

Epsilon-greedy strategy

Boltzmann sampling

Optimism in the face of uncertainty

Intrinsic motivation

9


How to operate in multi-agent environments?

10

Main Point of the Presentation

Solutions to the difficulties mentioned can be

formulated within the MDP framework, but with

specific choices of its components.

11


To get an idea of what it is, let us consider a tiger example:

N. Daw (2013)

12


Partially Observable Markov Decision Process (POMDP) is comprised of:

• A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|

• A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|

• A reward function, 𝑅: 𝑆 × 𝐴 → ℝ

• A set of observations: Ω = 𝑜1, 𝑜2, … , 𝑜|O|

• A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎

• Conditional observation probabilities : 𝑍𝑖𝑗𝑎 = 𝑃 𝑂𝑡+1 = |𝑗 𝑆𝑡+1 = ⅈ, 𝑎𝑡 = 𝑎

• A discount factor, 𝛾 ∈ 0,1

13


How to solve POMDP?

Model-based approach

1. Reformulate it as MDP. For this purpose, use, e.g.:

• Belief state MDPs

• Cross-product MDPs

2. Solve the MDP obtained by means of e.g.:

• Policy Iteration

• Value Iteration

• Gradient methods

Model-free approach

• Incorporating memory (HMM, RNN, Finite State Controllers)

• Policy-gradient methods D. Braziunas (2003)

14


There exists a direct connection between POMDP and MDP characterized by a

quadruple :

D. Braziunas (2003)

b is a belief state: distribution over states.

15

Evolution of belief state (example):


16


D. Braziunas (2003)

How to find optimal policy for POMDP?

policy trees (for finite horizon):

The optimal t-step value function can be found simply by enumerating all the

possible policy trees in the set 𝛤𝑡

17


Optimal t-step POMDP value function is piecewise linear and convex in b

The problem is almost intractable due to its computational complexity. A set of

simplifications has been suggested.

18

Bayesian Reinforcement Learning

In Bayesian RL, we encode unknown 𝑇(𝑠t+1|𝑠t, 𝑎t) with random variables 𝜃𝑖𝑗𝑎

distributed in accordance with multinomial distribution.

The agent maintains a posterior belief b over all possible transition models {T} given

its previous experience and a prior (Dirichlet distribution).

The task can be reformulated as either POMDP or MDP by redefining the state as

consisting of observable part S and unobservable parameters of Dirichlet

distribution. The construction is called superstate: ሚ𝑆 = 𝑆 × 𝜃

0

Due to the complexity of the belief state, Bayesian RL is typically intractable in

terms of both planning and updating the belief after an action. A recent approximate

solution to Bayesian RL is the Bayesian exploration bonus. Lopes et al. (2012)

Ross et al. (2011)

19

Multi-agent Systems

Stochastic games extend MDPs to multiple agents. The main difference between the

standard MDP and MDP for multiple players is that each agent is independently

choosing actions and receiving rewards while the state transitions matrix is defined

for the full joint-action.

Dermed at al. (2011)

20

Multi-agent Systems

How to solve stochastic games?

Replace V(s) in Bellman’s equation with an achievable set function

As a group of n agents follow a joint-policy, each player receives rewards. The

discounted sum of these rewards is that player’s utility. Joint-utility is a vector

of players’ utilities.

An achievable set contains all possible joint-utilities that players can receive

using policies in equilibrium.

Dermed at al. (2011)

21

References

Nathaniel Daw. in Neuroeconomics (Chapter 16): Advanced Reinforcement

Learning. Elsevier Inc. (2013)

Darius Braziunas. POMDP solution methods. Tutorial. University of Toronto (2003)

Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-Yves Oudeyer. Exploration in

Model-based Reinforcement Learning by Empirically Estimating Learning Progress.

in NIPS Proceedings (2012)

Stephane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann. A Bayesian

Approach for Learning and Planning in Partially Observable Markov Decision

Processes. Journal of Machine Learning Research 12 (2011) 1729-1770

Liam Mac Dermed, Charles L. Isbell, Lora Weiss. Markov Games of Incomplete

Information for Multi-Agent Reinforcement Learning. Interactive Decision Theory

and Game Theory: Papers from 2011 AAAI Workshop

THANK YOU FOR YOUR ATTENTION !

22

meetup - deep learning moscow seminar # 9 reinforcement ...files.meetup.com/19039145/seminar9_beyond...

Documents