motivation - ml.informatik.uni-freiburg.de
Post on 03-Apr-2022
10 Views
Preview:
TRANSCRIPT
Machine Learning
Reinforcement Learning
Prof. Dr. Martin RiedmillerAG Maschinelles Lernen und Naturlichsprachliche Systeme
Institut fur InformatikTechnische Fakultat
Albert-Ludwigs-Universitat Freiburg
riedmiller@informatik.uni-freiburg.de
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 1
Motivation
Can a software agent learn to play Backgammon by itself?
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 2
Motivation
Can a software agent learn to play Backgammon by itself?
Learning from success or failure
Neuro-Backgammon:playing at worldchampion level (Tesauro,
1992)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 2
Can a software agent learn to balance a pole by itself?
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
Can a software agent learn to balance a pole by itself?
Learning from success or failure
Neural RL controllers:noisy, unknown, nonlinear (Riedmiller et.al.
)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
Can a software agent learn to cooperate with others by itself?
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4
Can a software agent learn to cooperate with others by itself?
Learning from success or failure
Cooperative RL agents:complex, multi-agent, cooperative(Riedmiller et.al. )
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4
Reinforcement Learning
has biological roots: reward and punishment
no teacher, but:
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Reinforcement Learning
has biological roots: reward and punishment
no teacher, but:
actions
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Reinforcement Learning
has biological roots: reward and punishment
no teacher, but:
actions + goal
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Reinforcement Learning
has biological roots: reward and punishment
no teacher, but:
actions + goallearn→ algorithm/ policy
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Reinforcement Learning
has biological roots: reward and punishment
no teacher, but:
actions + goallearn→ algorithm/ policy
Goal
Action 1Action 2 ...
.
Agent .
’Happy Programming’
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Actor-Critic Scheme (Barto, Sutton, 1983)
Critic
World
?
Internal
Actor
Actor Critic External reward
s
Actor-Critic Scheme:
• Critic maps external, delayed reward in internal training signal
• Actor represents policy
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6
Overview
I Reinforcement Learning - Basics
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7
A First Example
Goal
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8
A First Example
Goal
Repeat
Choose: Action a ∈{→,←, ↑}Until Goal is reached
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8
The ’Temporal Credit Assignment’ Problem
Goal
Which action(s) in the sequence has to be changed?
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9
The ’Temporal Credit Assignment’ Problem
Goal
Which action(s) in the sequence has to be changed?
⇒ Temporal Credit Assignment Problem
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9
Sequential Decision Making
Goal
Examples:Chess, Checkers (Samuel, 1959), Backgammon (Tesauro, 92)
Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 1983)), Robotics andcontrol, . . .
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 10
Three Steps
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 11
Three Steps
⇒ Describe environment as a Markov Decision Process (MDP)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 11
Three Steps
⇒ Describe environment as a Markov Decision Process (MDP)
⇒ Formulate learning task as a dynamic optimization problem
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 11
Three Steps
⇒ Describe environment as a Markov Decision Process (MDP)
⇒ Formulate learning task as a dynamic optimization problem
⇒ Solve dynamic optimization problem by dynamic programming methods
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 11
1. Description of the environment
Goal
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of states
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of statesA: (finite) set of actions
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of statesA: (finite) set of actions
Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of statesA: (finite) set of actions
Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition
For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function
f : S ×A→ S, s′ = f(s, a)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of statesA: (finite) set of actions
Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition
For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function
f : S ×A→ S, s′ = f(s, a)
’Markov’ property: Transition only depends on current state and action
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
1. Description of the environment
Goal
S: (finite) set of statesA: (finite) set of actions
Behaviour of the environment ’model’p : S × S ×A→ [0, 1]p(s′, s, a) Probability distribution of transition
For simplicity, we will first assume a deterministic environment. There, themodel can be described by a transition function
f : S ×A→ S, s′ = f(s, a)
’Markov’ property: Transition only depends on current state and action
Pr(st+1|st, at) = Pr(st+1|st, at, st−1, at−1, st−2, at−2, . . .)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12
2. Formulation of the learning task
every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)
1
2
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13
2. Formulation of the learning task
every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)
1
2
Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =
∑
t c(st, π(st)), s0 = s 2
1
1 12
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13
2. Formulation of the learning task
every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)
1
2
Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =
∑
t c(st, π(st)), s0 = s
Wanted: optimal policy π∗ : S → Awhere Jπ∗
(s) = minπ{∑
t c(st, π(st))|s0 = s}
2
1
1 12
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13
2. Formulation of the learning task
every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)
1
2
Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =
∑
t c(st, π(st)), s0 = s
Wanted: optimal policy π∗ : S → Awhere Jπ∗
(s) = minπ{∑
t c(st, π(st))|s0 = s}
2
1
1 12
⇒ Additive (path-)costs allow to consider all events
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13
2. Formulation of the learning task
every transition emits transition costs,’immediate costs’, c : S ×A→ ℜ(sometimes also called ’immediate reward’, r)
1
2
Now, an agent policy π : S → A can beevaluated (and judged):Consider pathcosts:Jπ(s) =
∑
t c(st, π(st)), s0 = s
Wanted: optimal policy π∗ : S → Awhere Jπ∗
(s) = minπ{∑
t c(st, π(st))|s0 = s}
2
1
1 12
⇒ Additive (path-)costs allow to consider all events
⇒ Does this solve the temporal credit assignment problem? YES!
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13
Choice of immediate cost function c(·) specifies policy to be learned
Example:
c(s) =
0 , if s success (s ∈ Goal)1000 , if s failure (s ∈ Failure)
1 , else
Goal
Jπ(sstart) = 12
Jπ(sstart) = 1004
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 14
Choice of immediate cost function c(·) specifies policy to be learned
Example:
c(s) =
0 , if s success (s ∈ Goal)1000 , if s failure (s ∈ Failure)
1 , else
Goal
Jπ(sstart) = 12
Jπ(sstart) = 1004
⇒ specification of requested policy by c(·) is simple!
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 14
3. Solving the optimization problem
For the optimal path costs it is known that
J∗(s) = mina{c(s, a) + J∗(f(s, a))}
(Principle of Optimality (Bellman, 1959))
⇒Can we compute J∗ (we will see why, soon)?
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 15
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s :
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) :=
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A{c(s, a)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk()}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}
5
7
2
?
1
1
1⇒
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Computing J∗: the value iteration (VI) algorithm
Start with arbitrary J0(s)
for all states s : Jk+1(s) := mina∈A{c(s, a) + Jk(f(s, a))}
5
7
2
?
1
1
1⇒
5
7
2
3
1
1
1
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16
Convergence of value iteration
Value iteration converges under certain assumptions, i.e. we have
limk→∞Jk = J∗
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17
Convergence of value iteration
Value iteration converges under certain assumptions, i.e. we have
limk→∞Jk = J∗
⇒ Discounted problems: Jπ∗(s) = minπ{
∑
t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17
Convergence of value iteration
Value iteration converges under certain assumptions, i.e. we have
limk→∞Jk = J∗
⇒ Discounted problems: Jπ∗(s) = minπ{
∑
t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)
⇒ Stochastic shortest path problems:
• there exists an absorbing terminal state with zero costs
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17
Convergence of value iteration
Value iteration converges under certain assumptions, i.e. we have
limk→∞Jk = J∗
⇒ Discounted problems: Jπ∗(s) = minπ{
∑
t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)
⇒ Stochastic shortest path problems:
• there exists an absorbing terminal state with zero costs
• there exists a ’proper’ policy (a policy that has a non-zero chance to finallyreach the terminal state)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17
Convergence of value iteration
Value iteration converges under certain assumptions, i.e. we have
limk→∞Jk = J∗
⇒ Discounted problems: Jπ∗(s) = minπ{
∑
t γtc(st, π(st))|s0 = s}where 0 ≤ γ < 1 (contraction mapping)
⇒ Stochastic shortest path problems:
• there exists an absorbing terminal state with zero costs
• there exists a ’proper’ policy (a policy that has a non-zero chance to finallyreach the terminal state)
• every non-proper policy has infinite path costs for at least one state
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17
Ok, now we have J∗
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 18
Ok, now we have J∗
⇒ when J∗ is known, then we also know an optimal policy:
π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}
5
7
21
1
1
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 18
Back to our maze
GoalStart
1
2345
67
9
346789
10
5
11
10
11 9
1011
1001
1001 1000
1000
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 19
Overview of the approach so far
Ziel
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20
Overview of the approach so far
Ziel
• Description of the learning task as an MDPS, A, T, f, c
c specifies requested behaviour/ policy
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20
Overview of the approach so far
Ziel
• Description of the learning task as an MDPS, A, T, f, c
c specifies requested behaviour/ policy
• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20
Overview of the approach so far
Ziel
• Description of the learning task as an MDPS, A, T, f, c
c specifies requested behaviour/ policy
• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
• Computation of an optimal policy from J∗
π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20
Overview of the approach so far
Ziel
• Description of the learning task as an MDPS, A, T, f, c
c specifies requested behaviour/ policy
• iterative computation of optimal pathcosts J∗:∀s ∈ S : Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
• Computation of an optimal policy from J∗
π∗(s) ∈ arg mina∈A{c(s, a) + J∗(f(s, a))}
• value function (’costs-to-go’) can be stored in a table
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20
Overview of the approach: Stochastic Domains
Ziel
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 21
Overview of the approach: Stochastic Domains
Ziel
• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{ (c(s, a) + Jk(s
′))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 21
Overview of the approach: Stochastic Domains
Ziel
• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{
∑
s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 21
Overview of the approach: Stochastic Domains
Ziel
• value iteration in stochastic environments:∀s ∈ S : Jk+1(s) = mina∈A{
∑
s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}
• Computation of an optimal policy from J∗
π∗(s) ∈ arg mina∈A{∑
s′∈S p(s, s′, a) (c(s, a) + Jk(s′))}
• value function J (’costs-to-go’) can be stored in a table
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 21
Reinforcement Learning
Problems of Value Iteration:
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Reinforcement Learning
Problems of Value Iteration:
for all s ∈ S :
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Reinforcement Learning
Problems of Value Iteration:
for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk()}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Reinforcement Learning
Problems of Value Iteration:
for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Reinforcement Learning
Problems of Value Iteration:
for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
problems:
• Size of S (Chess, robotics, . . . ) ⇒ learning time, storage?
• ’model’ (transition behaviour) f(s, a) or p(s′, s, a) must be known!
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Reinforcement Learning
Problems of Value Iteration:
for all s ∈ S :Jk+1(s) = mina∈A{c(s, a) + Jk(f(s, a))}
problems:
• Size of S (Chess, robotics, . . . ) ⇒ learning time, storage?
• ’model’ (transition behaviour) f(s, a) or p(s′, s, a) must be known!
Reinforcement Learning is dynamic programming for very large state spacesand/ or model-free tasks
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 22
Important contributions - Overview
• Real Time Dynamic Programming(Barto, Sutton, Watkins, 1989)
• Model-free learning (Q-Learning,(Watkins, 1989))
• neural representation of value function (or alternative functionapproximators)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 23
Real Time Dynamic Programming (Barto, Sutton, Watkins, 1989)
Idea:
instead For all s ∈ S now For some s ∈ S . . .
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 24
Real Time Dynamic Programming (Barto, Sutton, Watkins, 1989)
Idea:
instead For all s ∈ S now For some s ∈ S . . .⇒ learning based on trajectories (experiences)
Goal
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 24
Q-Learning
Idea (Watkins, Diss, 1989):
In every state store for every action the expected costs-to-go. Qπ(s, a) denotes
the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 25
Q-Learning
Idea (Watkins, Diss, 1989):
In every state store for every action the expected costs-to-go. Qπ(s, a) denotes
the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):
Qπ(s, a) := (c(s, a) + Jπ(s′))
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 25
Q-Learning
Idea (Watkins, Diss, 1989):
In every state store for every action the expected costs-to-go. Qπ(s, a) denotes
the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):
Qπ(s, a) :=∑
s′∈S
p(s′, s, a)(c(s, a) + Jπ(s′))
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 25
Q-Learning
Idea (Watkins, Diss, 1989):
In every state store for every action the expected costs-to-go. Qπ(s, a) denotes
the expected future pathcosts for applying action a in state s (and continuingaccording to policy π):
Qπ(s, a) :=∑
s′∈S
p(s′, s, a)(c(s, a) + Jπ(s′))
where Jπ(s′) expected pathcosts when starting from s′ and acting according toπ
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 25
Q-learning: Action selection
is now possible without a model:
Original VI: state evaluationAction selection:
π∗(s) ∈ arg min{c(s, a) + J
∗(f(s, a))}
5
7
21
1
1
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 26
Q-learning: Action selection
is now possible without a model:
Original VI: state evaluationAction selection:
π∗(s) ∈ arg min{c(s, a) + J
∗(f(s, a))}
5
7
21
1
1
Q: state-action evaluationAction selection:
π∗(s) = arg min Q
∗(s, a)
3
6
8
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 26
Learning an optimal Q-Function
To find Q∗, a value iteration algorithm can be applied
Qk+1(s, u) := (c(s, a) + Jk(s′))
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 27
Learning an optimal Q-Function
To find Q∗, a value iteration algorithm can be applied
Qk+1(s, u) :=∑
s′∈S
p(s′, s, a)(c(s, a) + Jk(s′))
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 27
Learning an optimal Q-Function
To find Q∗, a value iteration algorithm can be applied
Qk+1(s, u) :=∑
s′∈S
p(s′, s, a)(c(s, a) + Jk(s′))
where Jk(s) = mina′∈A(s) Qk(s, a′)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 27
Learning an optimal Q-Function
To find Q∗, a value iteration algorithm can be applied
Qk+1(s, u) :=∑
s′∈S
p(s′, s, a)(c(s, a) + Jk(s′))
where Jk(s) = mina′∈A(s) Qk(s, a′)
⋄ Furthermore, learning a Q-function without a model, by experience oftransition tuples (s, a)→ s′ only is possible:
Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation)
Qk+1(s, a) := (1− α) Qk(s, a) + α (c(s, a) + mina′∈A(s′)
Qk(s′, a′))
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 27
Summary Q-learning
Q-learning is a variant of value iteration when no model is available
it is based on two major ingredigents:
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 28
Summary Q-learning
Q-learning is a variant of value iteration when no model is available
it is based on two major ingredigents:
• uses a representation of costs-to-go for state/ action-pairs Q(s, a)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 28
Summary Q-learning
Q-learning is a variant of value iteration when no model is available
it is based on two major ingredigents:
• uses a representation of costs-to-go for state/ action-pairs Q(s, a)
• uses a stochastic approximation scheme to incrementally computeexpectation values on the basis of observed transititions (s, a)→ s′
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 28
Summary Q-learning
Q-learning is a variant of value iteration when no model is available
it is based on two major ingredigents:
• uses a representation of costs-to-go for state/ action-pairs Q(s, a)
• uses a stochastic approximation scheme to incrementally computeexpectation values on the basis of observed transititions (s, a)→ s′
⋄ converges under the same assumption as value iteration + ’every state/action pair has to be visited infinitely often’ + conditions for stochasticapproximation
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 28
Q-Learning algorithm
Repeat
Repeat
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration scheme
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)
Until
Until
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)
Until Terminal state reachedUntil
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Q-Learning algorithm
Repeat
start in arbitrary initial state s0; t = 0Repeat
choose action greedily ut := arg mina∈AQk(st, a)or ut according to an exploration schemeapply ut in the environment: st+1 = f(st, ut, wt)learn Q-value:Qk+1(st, ut) := (1− α)Qk(st, ut) + α(c(st, ut) + Jk(st+1))where Jk(st+1) := mina∈A Qk(st+1, a)
Until Terminal state reachedUntil policy is optimal (’enough’)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 29
Representation of the path-costs in a function approximator
Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))
...
...
x J(x)
wij
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30
Representation of the path-costs in a function approximator
Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))
...
...
x J(x)
wij
⇒ few parameters (here: weights) specify value function for a large statespace
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30
Representation of the path-costs in a function approximator
Idea: neural representation of value function (or alternative functionapproximators) (Neuro Dynamic Programming (Bertsekas, 1987))
...
...
x J(x)
wij
⇒ few parameters (here: weights) specify value function for a large statespace
⇒ learning by gradient descent: ∂E∂wij
= ∂(J(s′)−c(s,a)−J(s))2
∂wij
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30
Example: learning to intercept in robotic soccer
• as fast as possible (anticipation of interceptposition)
• random noise in ball and player movement→need for corrections
• sequence of turn(θ)and dash(v)-commandsrequired
.
.
.
⇒handcoding a routine is a lot of work, many parameters to tune!
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 31
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
• S+: Ball in kickrange
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
• S+: Ball in kickrange
• S−: e.g. collision with opponent
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
• S+: Ball in kickrange
• S−: e.g. collision with opponent
• c(s) =
0 , s ∈ S+
1 , s ∈ S−
0.01 , else
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
• S+: Ball in kickrange
• S−: e.g. collision with opponent
• c(s) =
0 , s ∈ S+
1 , s ∈ S−
0.01 , else
• Actions: turn(10o), turn(20o), . . . turn(360o), . . . dash(10),dash(20), . . .
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Reinforcement learning of intercept
Goal: Ball is in kickrange of player
• state space: Swork = positions on pitch
• S+: Ball in kickrange
• S−: e.g. collision with opponent
• c(s) =
0 , s ∈ S+
1 , s ∈ S−
0.01 , else
• Actions: turn(10o), turn(20o), . . . turn(360o), . . . dash(10),dash(20), . . .
• neural value function (6-20-1-architecture)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 32
Learning curves
0
10
20
30
40
50
60
70
80
90
100
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
"kick.stat" u 3:7
Percentage of successes
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
"kick.stat" u 3:11
Costs (time to intercept)
Martin Riedmiller, Albert-Ludwigs-Universitat Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 33
top related