reinforcement learning - cornell universitydkm/tutorials/reinforcement_learning.pdf ·...
TRANSCRIPT
![Page 2: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/2.jpg)
Task
Setup from Lenz et. al. 2014
Grasp the green cup.
Output: Sequence of controller actions
![Page 3: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/3.jpg)
Supervised Learning
Setup from Lenz et. al. 2014
Grasp the green cup.
Expert Demonstrations
![Page 4: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/4.jpg)
Supervised Learning
Setup from Lenz et. al. 2014
Grasp the green cup.
Expert Demonstrations
Problem?
![Page 5: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/5.jpg)
Supervised LearningGrasp the cup. Problem?
Training data
Test data No exploration
![Page 6: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/6.jpg)
Exploring the environment
![Page 7: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/7.jpg)
What is reinforcement learning?
“Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment”
- R. Sutton and A. Barto
![Page 8: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/8.jpg)
Interaction with the environment
reward +
new environment
action
Setup from Lenz et. al. 2014
Scalar reward
![Page 9: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/9.jpg)
Interaction with the environment
Setup from Lenz et. al. 2014
Episodic vs
Non-Episodic
.
.
.
a1
a2
an
r1
r2
rn
![Page 10: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/10.jpg)
Rollout
Setup from Lenz et. al. 2014
.
.
.
a1
a2
an
r1
r2
rn
hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni
![Page 11: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/11.jpg)
Setup
st ! st+1
atrt
e.g.,1$
![Page 12: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/12.jpg)
Policy
⇡(s, a) = 0.9
![Page 13: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/13.jpg)
Interaction with the environment
reward +
new environment
action
Setup from Lenz et. al. 2014
Objective?
![Page 14: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/14.jpg)
Objective
Setup from Lenz et. al. 2014
.
.
.
a1
a2
an
r1
r2
rn
hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni
maximize expected reward
E
"nX
t=1
rt
#Problem?
![Page 15: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/15.jpg)
Discounted Reward
.
.
.
a1
a2
an
r1
r2
rn
unbounded
maximize expected reward
discount future reward
Problem?
E
" 1X
t=0
�trt+1
#
E
" 1X
t=0
rt+1
#
� 2 [0, 1)
![Page 16: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/16.jpg)
Discounted Reward
.
.
.
a1
a2
an
r1
r2
rn
maximize discounted expected reward
E
"n�1X
t=0
�trt+1
#
if r M
E
" 1X
t=0
�trt+1
#
1X
t=0
�tM =M
1� �
and � 2 [0, 1)
![Page 17: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/17.jpg)
Need for discounting
• To keep the problem well formed
• Evidence that humans discount future reward
![Page 18: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/18.jpg)
Markov Decision ProcessMDP is a tuple where
• is a set of finite or infinite states
• is a set of finite or infinite actions
• For the transition
• is the transition probability
• is the reward for the transition
• is the discounted factor
(S,A, P,R, �)
S
A
P as,s0 2 P
Ras,s0 2 R
sa�! s0
� 2 [0, 1]
}Markov Asmp.
![Page 19: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/19.jpg)
MDP Example
Example from Sutton and Barto 1998
![Page 20: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/20.jpg)
Summary
.
.
.
a1
a2
an
r1
r2
rn
Maximize discounted expected reward
E
"n�1X
t=0
�trt+1
#
MDP is a tuple (S,A, P,R, �)
Agent controls the policy⇡(s, a)
![Page 21: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/21.jpg)
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
![Page 22: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/22.jpg)
Value functions • Expected reward from following a policy
V ⇡(s) = E
" 1X
t=0
�trt+1 | s1 = s,⇡
#
Q⇡(s, a) = E
" 1X
t=0
�trt+1 | s1 = s, a1 = a,⇡
#
State value function
State action value function
![Page 23: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/23.jpg)
State Value function
V ⇡(s) = E
" 1X
t=0
�trt+1 | s1 = s,⇡
#
s1
s2
s3
a1
a2
· · ·
· · ·
...
...
⇡(s1, a1)Pa1s1,s2
⇡(s2, a2)Pa2s2,s3
Ra1s1,s2
Ra2s2,s3
![Page 24: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/24.jpg)
State Value function
V ⇡(s1) = E
" 1X
t=0
�trt+1
#
s1
s2
s3
a1
a2
· · ·
· · ·
...
...
⇡(s1, a1)Pa1s1,s2
⇡(s2, a2)Pa2s2,s3
Ra1s1,s2
Ra2s2,s3
t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where=X
t
(r1 + �r2 · · · )p(t)
![Page 25: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/25.jpg)
State Value function
V ⇡(s1) = E
" 1X
t=0
�trt+1
#
=X
a1,s2
X
t0
P (s1, a1, s2)P (t0 | s1, a1, s2)�Ra1
s1,s2 + �(r2 · · · )
=X
a1
⇡(s1, a2)X
s2
P a1s1,s2
�Ra1
s1,s2 + �V (s2)
t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where
V ⇡(s2)
V ⇡(s2)}
=X
t
(r1 + �r2 · · · )p(t)
=X
a1,s2
P (s1, a1, s2){Ra1s1,s2 + �
X
t0
P (t0 | s1, a1, s2)(r2 · · · )}
![Page 26: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/26.jpg)
Bellman Self-Consistency Eqn
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Q⇡(s, a) =X
s0
P as,s0
(Ra
s,s0 + �X
a0
⇡(s0, a0)Q⇡(s0, a0)
)
Q⇡(s, a) =X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
similarly
![Page 27: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/27.jpg)
Bellman Self-Consistency Eqn
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Given N states, we have N equations in N variables
Solve the above equation
Does it have a unique solution?
Yes, it does. Exercise: Prove it.
![Page 28: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/28.jpg)
Optimal Policy
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Given a state s
policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2
V ⇡1(s) � V ⇡2(s)
How to define a globally optimal policy?
![Page 29: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/29.jpg)
Optimal Policy
policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2
V ⇡1(s) � V ⇡2(s)
How to define a globally optimal policy?
is an optimal policy if:⇡⇤
V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡
Does it always exists?
Yes it always does.
![Page 30: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/30.jpg)
Existence of Optimal Policy
Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
To show is optimal or equivalently:⇡⇤
�(s) = V ⇡⇤(s)� V ⇡s(s) � 0
V ⇡⇤(s) =
X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡⇤(s0)}
V ⇡s
(s) =X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡s(s0)}
![Page 31: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/31.jpg)
Existence of Optimal Policy
V ⇡⇤(s) =
X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡⇤(s0)}
V ⇡s
(s) =X
a
⇡s(s, a)X
s0
P as,s0{Ra
s,s0 + �V ⇡s(s0)}
Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
�(s) = V ⇡⇤(s)� V ⇡s(s) = �
X
a
⇡s(s, a)X
s0
P as,s0{V ⇡⇤
(s0)� V ⇡s(s0)}
= �X
a
⇡s(s, a)X
s0
P as,s0{�(s0)} = �conv(�(s0))�
� �(s0)
![Page 32: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/32.jpg)
Existence of Optimal Policy Leader policy for every state is: ⇡s = argmax
⇡V ⇡
(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a
�(s) = V ⇡⇤(s)� V ⇡s(s) = �
X
a
⇡s(s, a)X
s0
P as,s0{V ⇡⇤
(s0)� V ⇡s(s0)}
�(s) � �min �(s0)
min �(s) � �min �(s0)
� 2 [0, 1) ) min �(s) � 0 Hence proved
= �X
a
⇡s(s, a)X
s0
P as,s0{�(s0)} = �conv(�(s0))�
![Page 33: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/33.jpg)
Bellman’s Optimality Condition Define and V ⇤(s) = V ⇡⇤
(s) Q⇤(s, a) = Q⇡⇤(s, a)
V ⇡⇤(s) =
X
a
⇡⇤(s, a)Q⇡⇤
(s, a) max
aQ⇡⇤
(s, a)
Claim: V ⇡⇤(s) = max
aQ⇡⇤
(s, a)
Let V ⇡⇤(s) < max
aQ⇡⇤
(s, a)
Define ⇡0(s) = argmax
aQ⇡⇤
(s, a)
![Page 34: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/34.jpg)
Bellman’s Optimality Condition ⇡0(s) = argmax
aQ⇡⇤
(s, a)
V ⇡0(s) = Q⇡0
(s,⇡0(s))
V ⇡⇤(s) =
X
a
⇡⇤(s, a)Q⇡⇤(s, a)
�(s) = V ⇡0(s)� V ⇡⇤
(s) = Q⇡0(s,⇡0(s))�
X
a
⇡⇤(s, a)Q⇡⇤(s, a)
� Q⇡0(s,⇡0(s))�Q⇡⇤
(s,⇡0(s)) = �X
s0
P⇡0(s)s,s0 �(s0)
�(s) � 0
such that 9s0 �(s0) > 0is not optimal⇡⇤
![Page 35: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/35.jpg)
Bellman’s Optimality Condition
V ⇤(s) = max
aQ⇤
(s, a)
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
similarly
Q⇤(s, a) =
X
s0
P as,s0{Ra
s,s0 + �max
a0Q⇤
(s0, a0)}
![Page 36: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/36.jpg)
Optimal policy from Q value
Given an optimal policy is given by:Q⇤(s, a)
Corollary: Every MDP has a deterministic optimal policy
⇡⇤(s) = argmax
aQ⇤
(s, a)
![Page 37: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/37.jpg)
Summary
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
V ⇡(s) =X
a
⇡(s, a)X
s0
P as,s0
�Ra
s,s0 + �V ⇡(s0)
Bellman’s optimality condition
Bellman’s self-consistency equation
An optimal policy exists such that:⇡⇤
V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡
![Page 38: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/38.jpg)
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
Consistency Equation Optimal Policy Optimality Condition
![Page 39: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/39.jpg)
Solving MDP
To solve an MDP is to find an optimal policy
![Page 40: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/40.jpg)
Bellman’s Optimality Condition
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
Iteratively solve the above equation
![Page 41: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/41.jpg)
Bellman Backup Operator
V ⇤(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)}
T : V ! V
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
![Page 42: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/42.jpg)
Dynamic Programming Solution
Initialize randomlyV 0
do
until kV t+1 � V tk1 > ✏
V t+1 = TV t
return V t+1
V t+1(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V t(s0)}
![Page 43: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/43.jpg)
Convergence
Theorem:
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R
kwhere
Proof:|(TV1)(s)� (TV2)(s)| = |max
a
X
s0
P as,s0{Ra
s,s0 + �V1(s0)}�
�max
a
X
s0
P as,s0{Ra
s,s0 + �V2(s0)}|
) kTV1 � TV2k1 �kV1 � V2k1
|max
x
f(x)�max
x
g(x)| max
x
|f(x)� g(x)|using
![Page 44: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/44.jpg)
Convergence
Theorem: kTV1 � TV2k1 = �kV1 � V2k1kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R
kwhere
Proof: |(TV1)(s)� (TV2)(s)| max
a�|
X
s0
P as,s0(V1(s
0)� V2(s
0))|
max
a�X
s0
P as,s0 |(V1(s
0)� V2(s
0))|
�kV1 � V2k1
) kTV1 � TV2k1 �kV1 � V2k1
max
amax
s0|V1(s
0)� V2(s
0)|
![Page 45: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/45.jpg)
Optimal is a fixed point
V ⇤= max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)} = TV ⇤
is a fixed point of TV ⇤
![Page 46: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/46.jpg)
Optimal is the fixed point
V ⇤= max
a
X
s0
P as,s0{Ra
s,s0 + �V ⇤(s0)} = TV ⇤
is a fixed point of TV ⇤
Theorem: is the only fixed point of V ⇤ T
TV1 = V1Proof: TV2 = V2
kV1 � V2k1 = kTV1 � TV2k1 �kV1 � V2k1
As therefore kV1 � V2k1 = 0 ) V1 = V2� 2 [0, 1)
![Page 47: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/47.jpg)
Dynamic Programming Solution
Initialize randomlyV 0
do
until kV t+1 � V tk1 > ✏
V t+1 = TV t
return V t+1
Theorem: algorithm converges for all V 0
Proof: kV t+1 � V ⇤k1 = kTV t � TV ⇤k1 �kV t � V ⇤k1kV t � V ⇤k1 �tkV 0 � V ⇤k1
limt!1
kV t � V ⇤k1 limt!1
�tkV 0 � V ⇤k1 = 0
Problem?
![Page 48: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/48.jpg)
Summary
Iteratively solving optimality condition
Bellman Backup Operator
Convergence of the iterative solution
V t+1(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V t(s0)}
(TV )(s) = max
a
X
s0
P as,s0{Ra
s,s0 + �V (s0)}
![Page 49: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/49.jpg)
What we learned
Reinforcement Learning
Agent-Reward-EnvironmentNo supervisionExploration
MDPPolicy
Consistency Equation Optimal Policy Optimality Condition
Bellman Backup Operator Iterative Solution
![Page 50: reinforcement learning - Cornell Universitydkm/tutorials/reinforcement_learning.pdf · 2016-11-11 · What is reinforcement learning? “Reinforcement learning is a computation approach](https://reader035.vdocuments.site/reader035/viewer/2022070818/5f1599af7847f66e97023702/html5/thumbnails/50.jpg)
In next tutorial
• Value and Policy Iteration
• Monte Carlo Solution
• SARSA and Q-Learning
• Policy Gradient Methods
• Learning to search OR Atari game paper?
https://dipendramisra.wordpress.com/