reinforcement learning michael roberts with material from: reinforcement learning: an introduction...

Post on 14-Jan-2016

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reinforcement LearningMichael Roberts

With Material From: Reinforcement Learning: An Introduction

Sutton & Barto (1998)

What is RL?

• Trial & error learning– without model– with model

• Structure

s1 s2

s3

s4

r1

r2

r3

RL vs. Supervised Learning

• Evaluative vs. Instructional feedback

• Role of exploration

• On-line performance

K-armed Bandit Problem

Agent

Actions

Average Rewards

10

-5

100

0

0, 0, 5, 10, 35

5, 10, -15, -15, -10

K-armed Bandit Cont.

• Greedy exploration• ε-greedy • Softmax

Average Reward:

Incremental formula:

where: α = 1 / (k+1)

Probability of choosing action a:

More General Problems

• More than one state• Delayed rewards

• Markov Decision Process (MDP)– Set of states – Set of actions– Reward function– State transition function

• Table or Function Approximation

Example: Recycling Robot

Recycling Robot: Transition Graph

Dynamic Programming

Backup Diagram

.25.25.25

.5.5.3.7.6.4

Rewards 10 5 200 200 -10 1000

Dynamic Programming:Optimal Policy

Backup for Optimal Policy

Performance Metrics

• Eventual convergence to optimality

• Speed of convergence to optimality

• Regret

(Kaelbling, L., Littman, M., & Moore, A. 1996)

Gridworld Example

Initialize V arbitrarily, e.g.           , for all        

Repeat

For each      

until         (a small positive number)

Output a deterministic policy,   such that:

Temporal Difference Learning

• RL without a model• Issue of: temporal credit assignment• Bootstraps like DP

• TD(0):

TD Learning

• Again, TD(0) =

TD(λ) =

where e is called an eligibility trace

Backup Diagram for TD(λ)

TD-Gammon (Tesauro)

Additional Work

• POMDP’s

• Macros

• Multi-agent rl

• Multiple reward structures

top related