![Page 1: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/1.jpg)
CS 182/CogSci110/Ling109Spring 2008
Reinforcement Learning: Algorithms
4/1/2008
Srini Narayanan – ICSI and UC Berkeley
![Page 2: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/2.jpg)
Lecture Outline
Introduction Basic Concepts
Expectation, Utility, MEU Neural correlates of reward based learning Utility theory from economics
Preferences, Utilities. Reinforcement Learning: AI approach
The problem Computing total expected value with discounting Q-values, Bellman’s equation TD-Learning
![Page 3: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/3.jpg)
Reinforcement Learning
Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward
function Must learn to act so as to maximize expected
utility Change the rewards, change the behavior
DEMO
![Page 4: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/4.jpg)
Elements of RL
Transition model, how action influences states Reward R, immediate value of state-action transition Policy , maps states to actions
Agent
Environment
State Reward Action
Policy
sss 221100 r a2
r a1
r a0 :::
![Page 5: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/5.jpg)
Markov Decision Processes Markov decision processes (MDPs)
A set of states s S A model T(s,a,s’) = P(s’ | s,a)
Probability that action a in state s leads to s’
A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one)
A start state (or distribution) Maybe a terminal state
MDPs are the simplest case of reinforcement learning In general reinforcement learning, we
don’t know the model or the reward function
![Page 6: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/6.jpg)
Elements of RL
r(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0
![Page 7: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/7.jpg)
Reward Sequences In order to formalize optimality of a policy, need to understand
utilities of reward sequences Typically consider stationary preferences: If I prefer one state
sequence starting today, I would prefer the same starting tomorrow.
Theorem: only two ways to define stationary utilities Additive utility:
Discounted utility:
![Page 8: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/8.jpg)
Elements of RL
Value function: maps states to state values
Discount factor [0, 1) (here 0.9)
V*(state) valuesr(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0 G
90 100 0
81 90 100
2 11π trγtγrtrsV ...
G 90 100 0
81 90 100
G 90 100 0
81 90 100
![Page 9: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/9.jpg)
RL task (restated)
Execute actions in environment,
observe results.
Learn action policy : state action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
![Page 10: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/10.jpg)
Hyperbolic discounting
Ainslee 1992
Short term rewards are different from long term rewardsUsed in many animal discounting modelsHas been used to explain
procrastinationaddiction
Evidence from Neuroscience(Next lecture)
![Page 11: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/11.jpg)
MDP Solutions In deterministic single-agent search, want an optimal
sequence of actions from start to a goal In an MDP we want an optimal policy (s)
A policy gives an action for each state Optimal policy maximizes expected utility (i.e. expected rewards)
if followed
Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s
![Page 12: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/12.jpg)
Example Optimal Policies
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01
![Page 13: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/13.jpg)
Utility of a State
Define the utility of a state under a policy:V(s) = expected total (discounted) rewards starting in s
and following Recursive definition (one-step look-ahead):
Also called policy evaluation
![Page 14: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/14.jpg)
Bellman’s Equation for Selecting actions
Definition of utility leads to a simple relationship amongst optimal utility values:
Optimal rewards = maximize over first action and then follow optimal policy
Formally: Bellman’s Equation
That’s my equation!
![Page 15: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/15.jpg)
r(state, action)immediate reward values
Q(state, action) valuesV*(state) values
100
0
0
100
G
0
0
0
0
0
0
0
0
0
90
81
100
G
0
81
72
90
81 81
72
90
81
100
G 90 100 0
81 90 100
Q-values
The expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))
![Page 16: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/16.jpg)
Representation
Explicit
Implicit Weighted linear function/neural network
Classical weight updating
State Action Q(s, a)
2 MoveLeft 81
2 MoveRight 100
... ... ...
![Page 17: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/17.jpg)
A table of values for each action: Q-Functions
A q-value is the value of a (state and action) under a policy Utility of taking starting in state s, taking
action a, then following thereafter
![Page 18: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/18.jpg)
The Bellman Equations
Definition of utility leads to a simple relationship amongst optimal utility values:
Optimal rewards = maximize over first action and then follow optimal policy
Formally:
![Page 19: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/19.jpg)
Optimal Utilities
Goal: calculate the optimal utility of each state
V*(s) = expected (discounted) rewards with optimal actions
Why: Given optimal utilities, MEU tells us the optimal policy
![Page 20: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/20.jpg)
MDP solution methods
If we know T(s, a, s’) and R(s,a,s’), then we can solve the MDP to find the optimal policy in a number of ways.
Dynamic programming Iterative Estimation methods
Value Iteration Assume 0 initial values for each state and update using the
Bellman equation to pick actions.
Policy iteration Evaluate a given policy (find V(s) for the policy), then change it
using Bellman updates till there is no improvement in the policy.
![Page 21: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/21.jpg)
Reinforcement Learning
Reinforcement learning: W have an MDP:
A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’)
Are looking for a policy (s) We don’t know T or R
I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn
![Page 22: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/22.jpg)
Reinforcement Learning
Target function is : state action
However…
We have no training examples of form
<state, action>
Training examples are of form
<<state, action>, new-state, reward>
![Page 23: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/23.jpg)
Passive Learning
Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values (and maybe the model)
In this case: No choice about what actions to take Just execute the policy and learn from experience
![Page 24: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/24.jpg)
Example: Direct EstimationSimple Monte Carlo
Episodes:
x
y
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)U(1,1) ~ (92 + -106) / 2 = -7
U(3,3) ~ (99 + 97 + -102) / 3 = 31.3
= 1, R = -1
+100
-100
![Page 25: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/25.jpg)
Full Estimation (Dynamic Programming)
)()( 11 ttt sVrEsV
T
T T T
st
rt1
st1
T
TT
T
TT
T
T
T
![Page 26: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/26.jpg)
Simple Monte Carlo
T T T TT
T T T T T
V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
![Page 27: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/27.jpg)
Combining DP and MC
T T T TT
T T T T T
st1
rt1
st
V(st ) V(st) rt1 V (st1 ) V(st )
TTTTT
T T T T T
![Page 28: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/28.jpg)
Reinforcement Learning
Target function is : state action
However…
We have no training examples of form
<state, action>
Training examples are of form
<<state, action>, new-state, reward>
![Page 29: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/29.jpg)
Model-Free Learning
Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates
(over time)
Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever
successor occurs
a
s
s, a
s,a,s’s’
![Page 30: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/30.jpg)
Q-Learning
Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:
Nudge the old estimate towards the new sample:
![Page 31: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/31.jpg)
Any problems with this?
What if the starting policy doesn’t let you explore the state space? T(s,a,s’) is unknown and never estimated. The value of unexplored states is never computed.
How do we address this problem? Fundamental problem in RL and in Biology AI solutions include
e-greedy Softmax
Evidence from Neuroscience (next lecture).
![Page 32: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/32.jpg)
Exploration / Exploitation
Several schemes for forcing exploration Simplest: random actions (-greedy)
Every time step, flip a coin With probability , act randomly With probability 1-, act according to current policy
(best q value for instance)
Problems with random actions? You do explore the space, but keep thrashing
around once learning is done One solution: lower over time Another solution: exploration functions
![Page 33: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/33.jpg)
Q-Learning
![Page 34: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/34.jpg)
Q Learning features
On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges to an optimal policy.
On average when alpha is small With probability 1 when alpha is high in the
beginning and low at the end (say 1/k)
![Page 35: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/35.jpg)
Reinforcement Learning
Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected utility Change the rewards, change the behavior
Examples: Learning your way around, reward for reaching the destination. Playing a game, reward at the end for winning / losing Vacuuming a house, reward for each piece of dirt picked up Automated taxi, reward for each passenger delivered
DEMO
![Page 36: CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley](https://reader038.vdocuments.site/reader038/viewer/2022110322/56649d3e5503460f94a17d4d/html5/thumbnails/36.jpg)
Demo of Q Learning
Demo arm-control Parameters
learning rate) discounted reward (high for future rewards) exploration(should decrease with time)
MDP Reward= number of the pixel moved to the right/
iteration number Actions : Arm up and down (yellow line), hand up and
down (red line)