bayesian reinforcement learning machine learning rcc 16 th june 2011

28
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Upload: esmond-parsons

Post on 13-Jan-2016

224 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Bayesian Reinforcement Learning

Machine Learning RCC

16th June 2011

Page 2: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Outline

• Introduction to Reinforcement Learning

• Overview of the field

• Model-based BRL

• Model-free RL

Page 3: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

References

• ICML-07 Tutorial– P. Poupart, M. Ghavamzadeh, Y. Engel

• Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto

Page 4: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Machine Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning

Page 5: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Definitions

State Action Reward

Policy

£££££

Reward function

Page 6: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Markov Decision Process

x0

a0

x1

Policy

Transition Probability

r0

a1

r1

Reward function

Page 7: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Value Function

Page 8: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Optimal Policy

• Assume one optimal action per state

Unknown

Value Iteration

Page 9: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Reinforcement Learning

• RL Problem: Solve MDP when reward/transition models are unknown

• Basic Idea: Use samples obtained from agent’s interaction with environment

Page 10: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Model-Based vs Model-Free RL

• Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy

• Model-Free: Directly learn value function/policy

Page 11: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

RL Solutions

Page 12: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

RL Solutions

• Value Function Algorithms– Define a form for the value function– Sample state-action-reward sequence– Update value function– Extract optimal policy

• SARSA, Q-learning

Page 13: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

RL Solutions

• Actor-Critic– Define a policy structure

(actor)– Define a value function

(critic)– Sample state-action-reward– Update both actor & critic

Page 14: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

RL Solutions

• Policy Search Algorithm– Define a form for the policy– Sample state-action-reward sequence– Update policy

• PEGASUS– (Policy Evaluation-of-Goodness

And Search Using Scenarios)

Page 15: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Online - Offline

• Offline– Use a simulator– Policy fixed for each ‘episode’– Updates made at the end of episode

• Online– Directly interact with environment– Learning happens step-by-step

Page 16: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Model-Free Solutions

1. Prediction: Estimate V(x) or Q(x,a)

2. Control: Extract policy• On-Policy• Off-Policy

Page 17: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Monte-Carlo Predictions

Valu

eR

ewar

d

-13

Leave car park Get out of city Motorway Enter Cambridge

State

-90 -83 -55 -11

-15 -61 -11

Up

date

d

-100 -87 -72 -11

Page 18: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Temporal Difference Predictions

Valu

eR

ewar

d

-13

Leave car park Get out of city Motorway Enter Cambridge

State

-90 -83 -55 -11

-15 -61 -11

Up

date

d

-96 -70 -72 -11

Page 19: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Advantages of TD

• Don’t need a model of reward/transitions

• Online, fully incremental

• Proved to converge given conditions on step-size

• “Usually” faster than MC methods

Page 20: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

From TD to TD(λ)

State

Reward

Terminal state

Page 21: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

From TD to TD(λ)

State

Reward

Terminal state

Page 22: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

SARSA & Q-learning

TD-Learning

SARSA Q-Learning

On-Policy

Estimate value function for

current policy

Off-Policy

Estimate value function for

optimal policy

Page 23: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference

xxx

xxxx

xxxx

1 2

Page 24: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference

xxx

xxxx

xxxx

1 2

Page 25: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference

Page 26: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference

Page 27: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference

Page 28: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

GP Temporal Difference