introduction to reinforcement learning dr kathryn merrick [email protected] 2008 spring school...

Introduction to Reinforcement Learning

Dr Kathryn [email protected]

2008 Spring School on Optimisation, Learning and Complexity

Friday 7th November, 15:30-17:00

mailto:[email protected]

Reinforcement Learning is…

… learning from trial-and-error and reward by interaction with

an environment.

Today’s Lecture

• A formal framework: Markov Decision Processes

• Optimality criteria

• Value functions

• Solution methods: Q-learning

• Examples and exercises

• Alternative models

• Summary and applications

Markov Decision Processes

The reinforcement learning problem can be represented as:

• A set S of states {s1, s2, s3, …}

• A set A of actions {a1, a2, a3, …}

• A transition function T:S x A S (deterministic)or T:S x A x S [0, 1] (stochastic)

• A reward function R:S x A Real or R:S x A x S Real

• A policy π:S A (deterministic)

or π:S x A [0, 1] (stochastic)

Optimality Criteria

Suppose an agent receives a reward rt at time t. Then optimal behaviour might:

• Maximise the sum of expected future reward:

• Maximise over a finite horizon:

• Maximise over an infinite horizon:

• Maximise over a discounted infinite horizon:

• Maximise average reward:

t

tr

T

ttr

0

0ttr

0tt

tr

n

r

n

n

tt

1lim

Value Functions

State value function

Vπ:S Real or

Vπ(s)

State-action value function

Qπ:S x A Real or

Qπ(s, a)

The expected sum of discounted reward for following the policy π from state s to the end of time.

The expected sum of discounted reward for starting in state s, taking action a once then following the policy π from state s’ to the end of time.

Optimal State Value Function

V*(s) = E{ R(s, a, s’) + γ V*(s’) | s, a }

= T(s, a, s’) [ R(s, a, s’) + γ V*(s’) ]

• A Bellman Equation

• Can be solved using dynamic programming

• Requires knowledge of the transition function T

a

max

'sa

max

Optimal State-Action Value Function

Q*(s, a) = E{ R(s, a, s’) + γ Q*(s’, a’) | s, a }

= T(s, a, s’) [ R(s, a, s’) + γ Q*(s’, a’) ]

• Also a Bellman Equation

• Also requires knowledge of the transition function T to solve using dynamic programming

• Can now define action selection:

π*(s) = Q*(s, a)

's

a'

max

a'

max

a

argmax

A Possible Application…

http://au.youtube.com/watch?v=CIF2SBVY-J0

Solution Methods

• Model based: – For example dynamic programming– Require a model (transition function) of the

environment for learning

• Model free:– Learn from interaction with the environment without

requiring a model– For example Q-learning…

Q-Learning by Example: Driving in Canberra

Parked Clean

Driving Clean

Parked Dirty

Driving Dirty

Drive

Park

Drive

Park

Clean CleanDrive

Clean

DrivePark

Park

Clean

Formulating the Problem• States

s1 Park cleans2 Park dirtys3 Drive cleans4 Drive dirty

• Actionsa1 Drivea2 Cleana3 Park

• Reward 1 for transitions to

a rt = ‘clean’ state 0 otherwise

• State-Action Table or Q-Table

a1 a2 a3

s1 ? ? ?

s2 ? ? ?

s3 ? ? ?

s4 ? ? ?

A Q-Learning Agent

Agent

Environment

st

rt

atLearning update to πt

Action selection from πt

Q-Learning Algorithmic Components

• Learning update (to Q-Table):

Q(s, a) (1-α)Q(s, a) + α[r + γ Q(s’, a’)]

or

Q(s, a) Q(s, a) + α[r + γ Q(s’, a’) - Q(s, a)]

• Action selection (from Q-Table):

a = f(Q(s, a))

a'

max

a'

max

a

argmax

Matlab Code Available on Request

Exercise

You need to program a small robot to learn to find food. • What assumptions will you make about the robot’s sensors and

actuators to represent the environment?

• How could you model the problem as an MDP?

• Calculate a few learning iterations in your domain by hand.

Alternatives

• Function approximation of the Q-table:– Neural networks– Decision trees– Gradient descent methods

• Reinforcement learning variants:– Relational reinforcement learning– Hierarchical reinforcement learning– Intrinsically motivated reinforcement learning

A final application…

http://au.youtube.com/watch?v=eeIjnNJCoqY

References and Further Reading

• Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press

http://www.cs.ualberta.ca/~sutton/book/the-book.html

• Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285

• Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77

http://www.cs.ualberta.ca/~sutton/book/the-book.html

introduction to reinforcement learning dr kathryn merrick [email protected] 2008 spring school...

Documents