bayesian reinforcement learning -...

58
Bayesian Reinforcement Learning Rowan McAllister and Karolina Dziugaite MLG RCC 21 March 2013 Rowan McAllister and Karolina Dziugaite (MLG RCC) Bayesian Reinforcement Learning 21 March 2013 1 / 34

Upload: voxuyen

Post on 30-Apr-2018

241 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning

Rowan McAllister and Karolina Dziugaite

MLG RCC

21 March 2013

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 1 / 34

Page 2: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Outline

1 IntroductionBayesian Reinforcement LearningMotivating Problem

2 Planning in MDP EnvironmentsMarkov Decision ProcessAction-Values

3 Reinforcement LearningQ-learning

4 Bayesian Reinforcement Learning (Model-Based)Bayesian RL as a POMDPBayesian Inference on BeliefsValue Optimisation

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 2 / 34

Page 3: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Bayesian Reinforcement Learning

Bayesian Reinforcement Learning - what is it?

Bayesian RL is about capturing and dealing with uncertainty, where‘classic RL’ does not. Research in Bayesian RL includes modelling thetransition-function, or value-function, policy, reward functionprobabilistically.

Differences over ‘classic RL’:

Resolves exploitation & exploration dilemma by planning in beliefspace.

Computationally intractable in general, but approximations exist.

Uses and chooses samples to learn from efficiently, suitable whensample cost is high, e.g. robot motion.

many slides use ideas from Goel’s MS&E235 lecture, Poupart’s ICML 2007 tutorial,Littman’s MLSS ‘09 slides

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 3 / 34

Page 4: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Bayesian Reinforcement Learning

Bayesian Reinforcement Learning - what is it?

Bayesian RL is about capturing and dealing with uncertainty, where‘classic RL’ does not. Research in Bayesian RL includes modelling thetransition-function, or value-function, policy, reward functionprobabilistically.

Differences over ‘classic RL’:

Resolves exploitation & exploration dilemma by planning in beliefspace.

Computationally intractable in general, but approximations exist.

Uses and chooses samples to learn from efficiently, suitable whensample cost is high, e.g. robot motion.

many slides use ideas from Goel’s MS&E235 lecture, Poupart’s ICML 2007 tutorial,Littman’s MLSS ‘09 slides

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 3 / 34

Page 5: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (1)

You have n tokens, which may be used in one of two slot machines.

The i’th machine returns $0 or $1 based on a fixed yet unknownprobability pi ∈ [0, 1]

Objective: maximise your winnings.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 4 / 34

Page 6: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (1)

You have n tokens, which may be used in one of two slot machines.

The i’th machine returns $0 or $1 based on a fixed yet unknownprobability pi ∈ [0, 1]

Objective: maximise your winnings.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 4 / 34

Page 7: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (1)

You have n tokens, which may be used in one of two slot machines.

The i’th machine returns $0 or $1 based on a fixed yet unknownprobability pi ∈ [0, 1]

Objective: maximise your winnings.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 4 / 34

Page 8: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (2)

As you play, you record what you see, formatted as (#wins, #losses).Your current records shows:Arm 1: (1,2)Arm 2: (21,19)

Which machine would you play next if you have 1 token remaining?

How about if you have 100 tokens remaining?

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 5 / 34

Page 9: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (2)

As you play, you record what you see, formatted as (#wins, #losses).Your current records shows:Arm 1: (1,2)Arm 2: (21,19)

Which machine would you play next if you have 1 token remaining?

How about if you have 100 tokens remaining?

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 5 / 34

Page 10: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (2)

As you play, you record what you see, formatted as (#wins, #losses).Your current records shows:Arm 1: (1,2)Arm 2: (21,19)

Which machine would you play next if you have 1 token remaining?

How about if you have 100 tokens remaining?

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 5 / 34

Page 11: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (3)

‘Classic’ Reinforcement Learning mentality:Two action classes (not mutually exclusive):

Exploit

Select action of greatest expected return given current belief on rewardprobabilities. i.e. select best action according to best guess of underlyingMDP: MLE or MAP → select Arm #2!

Explore

Select random action to increase our certainty of underlying MDP.This may lead to higher returns when exploiting in the future.

→ Dilemma (?): how to choose between exploitation and exploration?Seems like comparing apples and oranges...Many heuristics exist, but is there a principled approach?

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 6 / 34

Page 12: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (3)

‘Classic’ Reinforcement Learning mentality:Two action classes (not mutually exclusive):

Exploit

Select action of greatest expected return given current belief on rewardprobabilities. i.e. select best action according to best guess of underlyingMDP: MLE or MAP → select Arm #2!

Explore

Select random action to increase our certainty of underlying MDP.This may lead to higher returns when exploiting in the future.

→ Dilemma (?): how to choose between exploitation and exploration?Seems like comparing apples and oranges...Many heuristics exist, but is there a principled approach?

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 6 / 34

Page 13: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (4)

Steps towards resolving ‘exploitation’ vs ‘exploration’:model future beliefs in Arm 1 (#wins, #losses):

(1,2)

(2,2)

(1,3)

(3,2)

(2,3)

13

23

12

12

← higher expectation of rewards inthis potential future!

we can plan in this space, andcompute expected additional rewardsgained from exploratory actions.

Note: value of exploration depends on how much we can exploit thatinformation gain later, i.e. # tokens remaining. Alternatively, with infinitetokens and discount rate γ on future rewards, effective horizon ∝ −1

log(γ)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 7 / 34

Page 14: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (4)

Steps towards resolving ‘exploitation’ vs ‘exploration’:model future beliefs in Arm 1 (#wins, #losses):

(1,2)

(2,2)

(1,3)

(3,2)

(2,3)

13

23

12

12

← higher expectation of rewards inthis potential future!

we can plan in this space, andcompute expected additional rewardsgained from exploratory actions.

Note: value of exploration depends on how much we can exploit thatinformation gain later, i.e. # tokens remaining. Alternatively, with infinitetokens and discount rate γ on future rewards, effective horizon ∝ −1

log(γ)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 7 / 34

Page 15: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (4)

Steps towards resolving ‘exploitation’ vs ‘exploration’:model future beliefs in Arm 1 (#wins, #losses):

(1,2)

(2,2)

(1,3)

(3,2)

(2,3)

13

23

12

12

← higher expectation of rewards inthis potential future!

we can plan in this space, andcompute expected additional rewardsgained from exploratory actions.

Note: value of exploration depends on how much we can exploit thatinformation gain later, i.e. # tokens remaining. Alternatively, with infinitetokens and discount rate γ on future rewards, effective horizon ∝ −1

log(γ)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 7 / 34

Page 16: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Introduction Motivating Problem

Motivating Problem: Two armed bandit (4)

Steps towards resolving ‘exploitation’ vs ‘exploration’:model future beliefs in Arm 1 (#wins, #losses):

(1,2)

(2,2)

(1,3)

(3,2)

(2,3)

13

23

12

12

← higher expectation of rewards inthis potential future!

we can plan in this space, andcompute expected additional rewardsgained from exploratory actions.

Note: value of exploration depends on how much we can exploit thatinformation gain later, i.e. # tokens remaining. Alternatively, with infinitetokens and discount rate γ on future rewards, effective horizon ∝ −1

log(γ)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 7 / 34

Page 17: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments

Planning in MDP Environments

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 8 / 34

Page 18: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments

Planning overview

EnvironmentAgent

state, reward

action

Environment: a familiar MDP (we can simulate interaction with theworld accurately).

Goal: compute a policy that maximises expected long-termdiscounted rewards over a horizon (episodic or continual).

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 9 / 34

Page 19: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Markov Decision Process

Markov Decision Process

S, set of states s

A, set of action a

π : S → A, the policy, a mapping from state s to action a

T(s, a, s′) = P(s ′|s, a) ∈ [0, 1], transition probability, that state s ′ isreached by executing action a from state s

R(s, a, s′) ∈ R, a reward distribution. An agent receives a reward drawnfrom this when taking action a from state s reaching state s ′

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 10 / 34

Page 20: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Markov Decision Process

Markov Decision Process

S, set of states s

A, set of action a

π : S → A, the policy, a mapping from state s to action a

System dynamics

T(s, a, s′) = P(s ′|s, a) ∈ [0, 1], transition probability, that state s ′ isreached by executing action a from state s

R(s, a, s′) ∈ R, a reward distribution. An agent receives a reward drawnfrom this when taking action a from state s reaching state s ′

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 10 / 34

Page 21: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Markov Decision Process

Robot planning example

Goal: traverse to human-specified goal location ‘safely’States: physical space (x , y , yaw)Action: move forward, left, spin anticlockwise, etc.Rewards: ‘dangerousness’ of each (s, a) motion primitives

(a) Path Planning Scenario (b) Rewards (c) Policy

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 11 / 34

Page 22: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Markov Decision Process

Rewards

A measure of desirability of the agent being in a particular state. Use toencode what we want the agent to achieve, not how.

example: Agent learns to play chess:Don’t reward agent for capturing opponent’s queen, only for winning.(don’t want agent discovering novel ways to capture queens at expense of losinggames!)

Caveat: Reward shaping, the modification of reward function to give partial credit

without affecting the optimal policy (much), can be important in practice.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 12 / 34

Page 23: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Markov Decision Process

Rewards

A measure of desirability of the agent being in a particular state. Use toencode what we want the agent to achieve, not how.

example: Agent learns to play chess:Don’t reward agent for capturing opponent’s queen, only for winning.(don’t want agent discovering novel ways to capture queens at expense of losinggames!)

Caveat: Reward shaping, the modification of reward function to give partial credit

without affecting the optimal policy (much), can be important in practice.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 12 / 34

Page 24: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Action-Values

Optimal Action-Value Function

Optimal action value: expectation of all future discounted rewards fromtaking action a from state s, assuming subsequent actions chosen by theoptimal policy π∗. It can be re-expressed as a recursive relationship.

Q∗(s, a) = Eπ∗{ ∞∑

t=0

γtR(st , at)|s0 = s, a0 = a}

= Eπ∗{R(s0, a0) + γ

∞∑t=0

γtR(st+1, at+1)|s0 = s, a0 = a}

= R(s, a) + γEπ∗{

maxa′

Q∗(st+1, a′)|s0 = s, a0 = a

}= R(s, a) + γ

∑s′

T (s, a, s ′)[maxa′

Q∗(s ′, a′)]

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 13 / 34

Page 25: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Planning in MDP Environments Action-Values

Action-Value Optimisation

Need to satisfy the Bellman Optimality Equation:

Q∗(s, a) = R(s, a) + γ∑s′

T (s, a, s ′) maxa′

Q∗(s ′, a′)

π∗ = arg maxa

Q∗(s, a)

An algorithm to compute Q∗(s, a) is value iteration: for all s ∈ S repeatuntil convergence:

Qt+1(s, a)← R(s, a) + γ∑s′

T (s, a, s ′) maxa′

Qt(s′, a′)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 14 / 34

Page 26: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning

Reinforcement Learning

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 15 / 34

Page 27: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning

Reinforcement learning overview

EnvironmentAgent

state, reward

action

Environment: an unfamiliar MDP (T (s, a, s ′) and/or R(s, a)unknown) and possibly dynamic / changing.

Consequence: agent cannot simulate interaction with world inadvance, to predict future outcomes. Instead, the optimal policy islearned through sequential interaction and evaluative feedback.

Goal: same as planning (compute a policy that maximises expectedlong-term discounted rewards over a horizon).

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 16 / 34

Page 28: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning Q-learning

Q-learning

With known environmental models R(s, a) and T (s, a, s ′), Q’s computediteratively using value iteration (e.g. planning): :

Qt+1(s, a)← R(s, a) + γ∑s′

T (s, a, s ′) maxa′

Qt(s′, a′)

Q-learning

With unknown environmental models, Q’s computed as point estimates:

on experience {st , at , rt , st+1}:Q(st , at)← Q(st , at) + αt(Rt+1 + γmaxa′(Q(st+1, a

′))− Q(st , at))

if {s, a} visited infinitely often,∑

t αt = ∞,∑

t α2t <∞, then Q will converge to Q∗

(independent of policy being followed!).

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 17 / 34

Page 29: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning Q-learning

When to explore?: Heuristic approach to action selection

A couple heuristic examples agents use for action selection, to mostlyexploit and sometimes and explore:

ε-greedy:

π(a|s) =

{(1− ε), if a = argmaxaQt(s, a)

ε/|A|, if a 6= argmaxaQt(s, a)

e.g. ε = 5%

Softmax: π(a|s) = eQt (s,a)/τ∑i e

Qt (s,i)/τ

i.e. biased towards more fruitful actions. τ is a crank for morefrequent exploration.

Note: often, heuristics are too inefficient for online learning! We wish tominimise wasteful exploration.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 18 / 34

Page 30: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning Q-learning

When to explore?: Heuristic approach to action selection

A couple heuristic examples agents use for action selection, to mostlyexploit and sometimes and explore:

ε-greedy:

π(a|s) =

{(1− ε), if a = argmaxaQt(s, a)

ε/|A|, if a 6= argmaxaQt(s, a)

e.g. ε = 5%

Softmax: π(a|s) = eQt (s,a)/τ∑i e

Qt (s,i)/τ

i.e. biased towards more fruitful actions. τ is a crank for morefrequent exploration.

Note: often, heuristics are too inefficient for online learning! We wish tominimise wasteful exploration.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 18 / 34

Page 31: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based)

Bayesian Reinforcement Learning(Model-Based)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 19 / 34

Page 32: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based)

Brief Description

Start with a prior over transition probabilities T (s, a, s ′),maintain the posterior (update them) as evidence comes in.

Now we can reason about more/less likely MDPs, instead of justpossible MDPs or a single ‘best guess’ MDP.

Can plan in the space of posteriors to:- evaluate the likelihood of any possible outcome of an action.- model how that outcome will change the posterior.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 20 / 34

Page 33: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based)

Motivation (1)

Resolves ‘classic’ RL dilemma:

maximise immediate rewards (exploit), or

maximise info gain (explore)?

Wrong question!

→ Single objective: maximise expected rewards up to the horizon (as aweighted average over the possible futures).(implicitly trades-off exploration with exploitation optimally)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 21 / 34

Page 34: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based)

Motivation (2)

More Pros:

Prior information is easily used, can start planning straight away byrunning a full backup.

Easy to explicit encoding of prior knowledge / domain assumptions.

Easy to update belief if using conjugate priors, as we collect evidence

Cons:

Computationally intractable except in special cases (bandits, shorthorizons)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 22 / 34

Page 35: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian RL as a POMDP

Bayesian RL as a POMDP (1)

Let θsas′ denotes unknown MDP parameterT (s, a, s ′) = P(s ′|s, a) ∈ [0, 1]Let b(θ) be the agent’s prior belief over all unknown parameters θsas′

[Duff 2002]: Define hybrid state: Sp = S (certain) ×θsas′ (uncertain).Cast Bayesian RL as a Partially Observable Markov Decision Process(POMDP) P =< Sp,Ap,Op,Tp,Zp,Rp, γ, b

0p >

Use favourite POMDP solution technique. This provides a BayesOptimal policy in our original state space.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 23 / 34

Page 36: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian RL as a POMDP

Bayesian RL as a POMDP (2)

Sp = S × θ, hybrid states of known S and all unknown θsas′

Ap = A, original action set (unchanged)

Op = S : observation space

Tp(s, θsas′ , a, s′, θ′sas′) = P(s ′, θ′sas′ |s, θsas′ , a)

= P(θ′sas′ |θsas′)P(s ′|s, θsas′ , a)

= δ(θ′sas′ − θsas′)θsas′ , assuming θsas′ is stationary

Rp(s, θsas′ , a, s′, θ′sas′) = R(s, a, s ′)

Zp(s ′, θ′sas′ , a, o) = P(o|s ′, θ′sas′ , a) = δ(o − s ′) , as observation is s ′

T (.): transition probability (known), R(.): reward distribution, Z(.): observation function

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 24 / 34

Page 37: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian RL as a POMDP

Bayesian RL as a POMDP (2)

Sp = S × θ, hybrid states of known S and all unknown θsas′

Ap = A, original action set (unchanged)

Op = S : observation space

Tp(s, θsas′ , a, s′, θ′sas′) = P(s ′, θ′sas′ |s, θsas′ , a)

= P(θ′sas′ |θsas′)P(s ′|s, θsas′ , a)

= δ(θ′sas′ − θsas′)θsas′ , assuming θsas′ is stationary

Rp(s, θsas′ , a, s′, θ′sas′) = R(s, a, s ′)

Zp(s ′, θ′sas′ , a, o) = P(o|s ′, θ′sas′ , a) = δ(o − s ′) , as observation is s ′

T (.): transition probability (known), R(.): reward distribution, Z(.): observation function

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 24 / 34

Page 38: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Bayesian Inference

let b(θ) be the agent’s current (prior) belief over all unknown parametersθsas′ . For each {s, a, s ′} transition observed, the belief is updatedaccordingly:

bsas′(θ) ∝ b(θ)P(s ′|θsas′ , s, a)

= b(θ)θsas′

(posterior) ∝ (prior)× (likelihood)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 25 / 34

Page 39: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Common Prior: Dirichlet Distribution

Dir(θsa; nsa) = 1B(nsa)

∏s′(θsas′)

nsas′−1

suitable for discrete state spacesThe Dirichlet distribution is a conjugate prior to a multinomial likelihooddistribution (counts # a from s reached s ′). Thus easy closed for Bayesupdates.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 26 / 34

Page 40: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Bayesian Inference: Discrete MDPs

bsas′(θ) ∝ b(θ)θsas′

For discrete MDPs, we can define θsa = P(.|s, a), as a multinomial.

Choosing prior b(θ) form as a product of Dirichlets∏s,a Dir(θsa; nsa) ∝

∏s,a

∏s′(θsas′)

nsas′−1,the posterior / updated-belief retains the same form:

bsas′(θ) ∝ (∏s,a

Dir(θs a; ns a))θsas′

∝∏s,a

Dir(θs a; ns a + δs,a,s′(s, a, s′)) (1)

(where nsa is a vector of hyperparameters nsas′ , the # {s, a, s ′} transitions observed)

→ So belief updated by incrementing corresponding nsas′

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 27 / 34

Page 41: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Bayesian Inference: Discrete MDPs

bsas′(θ) ∝ b(θ)θsas′

For discrete MDPs, we can define θsa = P(.|s, a), as a multinomial.

Choosing prior b(θ) form as a product of Dirichlets∏s,a Dir(θsa; nsa) ∝

∏s,a

∏s′(θsas′)

nsas′−1,the posterior / updated-belief retains the same form:

bsas′(θ) ∝ (∏s,a

Dir(θs a; ns a))θsas′

∝∏s,a

Dir(θs a; ns a + δs,a,s′(s, a, s′)) (1)

(where nsa is a vector of hyperparameters nsas′ , the # {s, a, s ′} transitions observed)→ So belief updated by incrementing corresponding nsas′

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 27 / 34

Page 42: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Factoring structural priors

Can transition dynamics can be jointly expressed as a function of a smallernumber of parameters?

Parameter tying is a special case of knowing θsas′ = θs as′ .

- realistic, real-life action outcomes from one state often generalise

- useful, speeds up convergence / less trials required → mitigates expensive hardware

collisions etc.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 28 / 34

Page 43: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Factoring structural priors: Example (1)

Taxi example: [Dietterich 1998]

Goal: pick up passenger and drop at destination.

States: 25 taxi location × 4 pickup locations × 4 dropoff destinations

Actions: N, S, E, W, pickup, dropoff

Rewards: +20 for successful delivery of passenger, -10 for illegalpickup or dropoff, -1 otherwise.

#θsa = |S | × |A| = 400× 6 = 2400.

1

2

3

4

0

R G

BY0 1 2 3 4

Figure: possible pickup, dropoff locations: R, Y, G, B

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 29 / 34

Page 44: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Bayesian Inference on Beliefs

Factoring structural priors: Example (2)

We can factor θsa: We know a priori that navigation to pickup location isindependent of dropoff-destination! Furthermore, navigation task isindependent of purpose (pickup or dropoff).

Pickup

Get

Root

Put

Navigate(t) Putdown

South East WestNorth

t/destinationt/source

Figure: navigation as a subroutine

→ # states required to learn navigation: 25× 4 = 100 < 400.Using a factored DBN model to generalises transitions for multiple states,we quarter the # of θsa to learn.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 30 / 34

Page 45: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Value Optimisation

Value Optimisation

Classic RL Bellman equation:Q∗(s, a) =

∑s′ P(s ′|s, a)[R(s, a, s ′) + γmaxa′ Q

∗(s ′, a′)]

POMDP Bellman equation, in BRL context:

Q∗(s, b, a) =∑

s′ P(s ′|s, b, a)[R(s, a, s ′) + γmaxa′ Q∗(s ′, bsas′ , a

′)]

The Bayes-optimal policy is π∗(s, b) = argmaxaQ∗(s, b, a), which

maximises the predicted reward up to the horizon, over a weighted averageof all the possible futures.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 31 / 34

Page 46: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Value Optimisation

Big Picture

Task: solve:Q∗(s, b, a) =

∑s′ P(s ′|s, b, a)[R(s, a, s ′) + γmaxa′ Q

∗(s ′, bsas′ , a′)]

Challenge: Size of s × bsas′ space grows exponentially with number of θsas′

parameters → Bayes-Optimal solution intractable.

Solutions: approximate Q∗(s, b, a) via:

discretisation

exploration bonuses [BEB, Kolter 2009]

myopic value of info [Bayesian Q-learning, Dearden 1999]

sample beliefs [Bayesian Forward Search Sparse Sampling, Littman 2012]

sample MDPs, update occasionally [Thompson Sampling, Strens 2000]

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 32 / 34

Page 47: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Value Optimisation

Algorithm: BEETLE (1)

[Poupart et al. 2006]

Exploits piecewise linear and convex property of POMDP value function[Sondik 1971].

Sample a set of reachable (s, b) pairs by simulating a random policy.Uses Point Based Value Iteration (PBVI) [Pineau 2003] to approximatevalue iteration, by tracking value + derivative of sampled beliefpoints. Proves α-functions (one per sampled belief) in Bayesian RLare a set of multivariate polynomials of θsas′ , andV ∗s (θ) = maxipolyi (θ).Scalable, has a closed form value representation under Bellmanbackups.

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α2}

Figure 1: POMDP value function representation using PBVI (on theleft) and a grid (on the right).

Figure: Each α is a —S—-dim hyperplanes, defines value function over abounded region of belief [Pineau 2003]

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 33 / 34

Page 48: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bayesian Reinforcement Learning (Model-Based) Value Optimisation

Algorithm: BEETLE (2)

1 2 3 4 5a,0 a,0 a,0 a,0

b,2b,2b,2b,2

b,2 a,10

Figure1. The“Chain”problem

Figure: [Strens 2002]

Table 1. Expected total reward for chain and handwashing problems. na-m indicates insufficient memory.problem |S| |A | free optimal discrete exploit Beetle Beetle time (minutes)

params (utopic) POMDP precomputation optimizationchain tied 5 2 1 3677 3661± 27 3642± 43 3650± 41 0.4 1.5chain semi 5 2 2 3677 3651± 32 3257± 124 3648± 41 1.3 1.3chain full 5 2 40 3677 na-m 3078± 49 1754± 42 14.8 18.0

handw tied 9 2 4 1153 1149± 12 1133± 12 1146± 12 2.6 11.8handw semi 9 2 8 1153 990± 8 991± 31 1082± 17 3.4 52.3handw full 9 6 270 1083 na-m 297± 10 385± 10 125.3 8.3

Figure: [Poupart 2006]

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 34 / 34

Page 49: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

PAC-MDP and Bayesian RL algorithms

Model Based Interval Estimation with Exploration Bonus MBIE-EB(PAC-MDP) and Bayesian Exploration Bonus BEB (Bayesian RL) will becompared. Both:

count how many times each transition (s, a, s ′) has happened:α(s, a, s ′);

use counts to produce an estimate of the underlying MDP;

add exploration bonus to the reward for (s, a) pair if not visitedenough;

act greedily with respect to this modified MDP.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 1 / 10

Page 50: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Bellman’s optimality equations with exploration bonus

Let α0(s, a) =∑

s′ α(s, a, s ′) and b = {α(s, a, s ′)}. Then

P(s ′|b, s, a) = α(s,a,s′)α0(s,a)

Attempts to maximize :

BEB

V ∗H(b, s) = maxa

{R(s, a) +

β

1 + α0(s, a)

+∑s′

P(s ′|b, s, a)V ∗H−1(s ′)

}MBIE-EB

V ∗H(s) = maxa

{R(s, a) +

β√α0(s, a)

+∑s′

P(s ′|b, s, a)V ∗H−1(b, s ′)

}Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 2 / 10

Page 51: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Near Bayesian optimal

Approximate Bayes-Optimal :If At denotes the policy followed by the algorithm at time t, then withprobability greater than 1− δ

VAt (bt , st) ≥ V ∗(bt , st)− ε

where V ∗(b, s) is the value function for a Bayes-optimal strategy.

Near Bayes Optimal :With probability ≥ 1− δ, an agent follows an approximate Bayes-optimalpolicy for all but a “small” number of steps, which is polynomial inquantities representing the system.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 3 / 10

Page 52: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

BEB near Bayesian optimality

Theorem (Kolter and Ng, 2009)

Let At denote the policy followed by the BEB algorithm (with β = 2H2)at time t, and let st and bt be the corresponding state and belief. Alsosuppose we stop updating the belief for a state-action pair whenα0(a, s) > 4H3/ε. Then with probability at least 1− δ,

VAtH (bt , st) ≥ V ∗H(bt , st)− ε

i..e, the algorithm is ε-close to the optimal Bayesian policy for all but

m = O

(|S ||A|H6

ε2log|S ||A|δ

)time steps.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 4 / 10

Page 53: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

PAC-MDP

Theorem (Strehl, Li and Littman 2006)

Let At denote the policy followed by some algorithm. Also, let thealgorithm satisfy the following properties, for some input ε:

acts greedily for every time step t;

is optimistic (Vt(s) ≥ V ∗t (s)− ε)has bounded learning complexity (bounded number of action-valueestimate updates and number of escape events)

is accurate (Vt(s)− V πtMKt

(s) ≤ ε)Then, with probability greater than 1− δ, for all but

O

(|S |2|A|H6

ε2

)time steps, the algorithm follows an 4ε optimal policy.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 5 / 10

Page 54: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Rate of Decay

Theorem (Kolter and Ng, 2009)

Let At denote the policy followed an algorithm using any (arbitrarycomplex) exploration bonus that is upper bounded by

β

α0(s, a)p

for some constant β and p > 1/2. Then ∃ some MDP M and ε0(β, p), s.t.with probability greater than δ0 = 0.15,

VAtH (st) < V ∗H(st)− ε0

will hold for an unbounded number of steps.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 6 / 10

Page 55: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

The proof uses the following inequality.

Lemma (Slud’s inequality)

Let X1, ...Xn be i.i.d. Bernoulli random variables, with mean µ > 3/4.Then

P

(µ− 1

n

n∑i=1

Xi > ε

)≥ 1− Φ

(ε√n√

µ(1− µ)

)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 7 / 10

Page 56: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Proof

The lower bound on the probability that the algorithm’s estimate of thereward for playing a1 plus the exploration bonus is pessimistic by at leastβ/np:

P

(3/4− 1

n

n∑i=1

ri − f (n) ≥ β

np

)

≥ P

(3/4− 1

n

n∑i=1

ri ≥2β

np

)

≥ 1− Φ

(8β√

3np−1/2

)

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 8 / 10

Page 57: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Proof

Set

n ≥(

8β√3

) 22p−1

and

ε0(β, p) = β/

((8β√

3

) 2p2p−1

)So at stage n with probability at least 0.15, action a2 will be preferred overa1 and the agent will stop exploring ⇒ the algorithm will be more than εsuboptimal for an infinite number of steps, for any ε ≥ ε0.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 9 / 10

Page 58: Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Conclusions

Both algorithms use the same intuition: in order to perform well, wewant to explore enough that we learn an accurate model of thesystem;

For PAC-MDP, exploration bonus cannot shrink at a rate faster that12 or they fail to be near optimal, and slow rate of decay results inmore exploration;

BEB reduces the amount of exploration needed, which allows us toachieve lower sample complexity and use greedier exploration method;

a near Bayesian optimal policy is not near-optimal: the optimality isconsidered with respect to the Bayesian policy, rather than theoptimal policy for some fixed MDP.

Rowan McAllister and Karolina Dziugaite (MLG RCC)Bayesian Reinforcement Learning 21 March 2013 10 / 10