introduction to reinforcement learning dr kathryn merrick [email protected] 2008 spring school...
TRANSCRIPT
Introduction to Reinforcement Learning
Dr Kathryn [email protected]
2008 Spring School on Optimisation, Learning and Complexity
Friday 7th November, 15:30-17:00
Reinforcement Learning is…
… learning from trial-and-error and reward by interaction with
an environment.
Today’s Lecture
• A formal framework: Markov Decision Processes
• Optimality criteria
• Value functions
• Solution methods: Q-learning
• Examples and exercises
• Alternative models
• Summary and applications
Markov Decision Processes
The reinforcement learning problem can be represented as:
• A set S of states {s1, s2, s3, …}
• A set A of actions {a1, a2, a3, …}
• A transition function T:S x A S (deterministic)or T:S x A x S [0, 1] (stochastic)
• A reward function R:S x A Real or R:S x A x S Real
• A policy π:S A (deterministic)
or π:S x A [0, 1] (stochastic)
Optimality Criteria
Suppose an agent receives a reward rt at time t. Then optimal behaviour might:
• Maximise the sum of expected future reward:
• Maximise over a finite horizon:
• Maximise over an infinite horizon:
• Maximise over a discounted infinite horizon:
• Maximise average reward:
t
tr
T
ttr
0
0ttr
0tt
tr
n
r
n
n
tt
1lim
Value Functions
State value function
Vπ:S Real or
Vπ(s)
State-action value function
Qπ:S x A Real or
Qπ(s, a)
The expected sum of discounted reward for following the policy π from state s to the end of time.
The expected sum of discounted reward for starting in state s, taking action a once then following the policy π from state s’ to the end of time.
Optimal State Value Function
V*(s) = E{ R(s, a, s’) + γ V*(s’) | s, a }
= T(s, a, s’) [ R(s, a, s’) + γ V*(s’) ]
• A Bellman Equation
• Can be solved using dynamic programming
• Requires knowledge of the transition function T
a
max
'sa
max
Optimal State-Action Value Function
Q*(s, a) = E{ R(s, a, s’) + γ Q*(s’, a’) | s, a }
= T(s, a, s’) [ R(s, a, s’) + γ Q*(s’, a’) ]
• Also a Bellman Equation
• Also requires knowledge of the transition function T to solve using dynamic programming
• Can now define action selection:
π*(s) = Q*(s, a)
's
a'
max
a'
max
a
argmax
Solution Methods
• Model based: – For example dynamic programming– Require a model (transition function) of the
environment for learning
• Model free:– Learn from interaction with the environment without
requiring a model– For example Q-learning…
Q-Learning by Example: Driving in Canberra
Parked Clean
Driving Clean
Parked Dirty
Driving Dirty
Drive
Park
Drive
Park
Clean CleanDrive
Clean
DrivePark
Park
Clean
Formulating the Problem• States
s1 Park cleans2 Park dirtys3 Drive cleans4 Drive dirty
• Actionsa1 Drivea2 Cleana3 Park
• Reward 1 for transitions to
a rt = ‘clean’ state 0 otherwise
• State-Action Table or Q-Table
a1 a2 a3
s1 ? ? ?
s2 ? ? ?
s3 ? ? ?
s4 ? ? ?
Q-Learning Algorithmic Components
• Learning update (to Q-Table):
Q(s, a) (1-α)Q(s, a) + α[r + γ Q(s’, a’)]
or
Q(s, a) Q(s, a) + α[r + γ Q(s’, a’) - Q(s, a)]
• Action selection (from Q-Table):
a = f(Q(s, a))
a'
max
a'
max
a
argmax
Exercise
You need to program a small robot to learn to find food. • What assumptions will you make about the robot’s sensors and
actuators to represent the environment?
• How could you model the problem as an MDP?
• Calculate a few learning iterations in your domain by hand.
Alternatives
• Function approximation of the Q-table:– Neural networks– Decision trees– Gradient descent methods
• Reinforcement learning variants:– Relational reinforcement learning– Hierarchical reinforcement learning– Intrinsically motivated reinforcement learning
References and Further Reading
• Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press
http://www.cs.ualberta.ca/~sutton/book/the-book.html
• Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285
• Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77