reinforcement learning: an overview - github pages · general approach for π∗ • if the model...
TRANSCRIPT
![Page 1: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/1.jpg)
Reinforcement Learning: An Overview
Yi Cui, Fei Feng, Yibo Zeng
Dept. of Mathematics, UCLA
November 13, 2017
1 / 32
![Page 2: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/2.jpg)
Model : Sequential Markov Decision Process
• discrete time steps, t= 1,2,3,...• St ∈ S, state space• At ∈ A(St) ⊂ A, action space• Rt(St, At) : S ×A → R, immediate reward of taking At at St• p(St+1, Rt|St, At), transition probability
2 / 32
![Page 3: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/3.jpg)
An Optimization Problem
Policy:π : S → P
(A)
Agent’s Goal: choose a policy to maximize a function of reward sequence.
• expected total discounted-reward
E[G1] = E[R1 + γR2 + γ2R3 · · · ]
= E[∞∑k=1
γkRk]
where 0 < γ < 1 is a discount rate.• averaged-reward
3 / 32
![Page 4: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/4.jpg)
An Optimization Problem
State-value function of π
V π(s) = E [Gt | St = s;π]
Fix π, V π satisfy Bellman equation (self-consistency condition):
V π(s) =∑a
π(a|s)∑s′,r
p(s′, r|s, a)[r + γV π(s′)], ∀s ∈ S
Action-value function of π
Qπ(s, a) = E [Gt | St = s,At = a;π]
= E [Rt + γGt+1 | St = s,At = a;π]
= E [Rt + γV π(St+1) | St = s,At = a;π]
4 / 32
![Page 5: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/5.jpg)
Optimal Policy and Optimal Value
Build a partial ordering over policies
π ≥ π′ ⇐⇒ V π(s) ≥ V π′(s), ∀s ∈ S
Find an optimal policy π∗ such that
V ∗(s) := V π∗(s) = max
πV π(s), ∀s ∈ S.
The Bellman Optimality Equations for V ∗, Q∗ are
V ∗(s) = maxa
E[Rt + γV ∗(St+1)|St = s,At = a] (1)
Q∗(s, a) = E[Rt + γmaxa′
Q∗(St+1, a′)|St = s,At = a]. (2)
Note that
π∗(s) = arg maxa
Q∗(s, a), V ∗(s) = maxa
Q∗(s, a), ∀s ∈ S.
5 / 32
![Page 6: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/6.jpg)
General Approach for π∗
• If the model p(s′, r|s, a) is available → dynamic programming.• If no model → reinforcement learning(RL) algorithms
1 model-based method: learn model then derive optimal policy.2 model-free method: learn optimal policy without learning model.
For model-free method,
• Value-based: evaluate Q∗, derive a greedy policy1 Tabular Implementation: Sarsa[1], Q-learning[2]2 Function Approximation: Deep Q-learning
• Policy-based: Search over policy space for better value: Actor-critic(DDPG)
6 / 32
![Page 7: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/7.jpg)
On-policy v.s. Off policy
• Learning Policy:a policy that maps experience into a current choice(experience contains: states visited, actions chosen, rewards received, etc.)
• Update Rule:How the algorithm uses experience to change its estimate of the optimalvalue function.
7 / 32
![Page 8: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/8.jpg)
Sarsa: On-policy TD Control
8 / 32
![Page 9: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/9.jpg)
Q-learning: Off-policy TD Control
9 / 32
![Page 10: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/10.jpg)
The Drawbacks of Tabular Methods
• Huge memory demand, non-scalable;• Impossible to list all states in more complicated scenarios.
Solution: Use functions to approximate Q.
10 / 32
![Page 11: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/11.jpg)
Deep Q-learning: Neural Network + Q-learning
Idea: represent Q-function with a neural network, that takes the state and actionas input and outputs the corresponding Q-value.Update Q table each episode → Update parameters of deep Q-network(DQN).
11 / 32
![Page 12: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/12.jpg)
Deep Q-learning
• Four continuous 84× 84 images.• Two convolution layers.• Two denses.• Output:vectors including every action’s Q value.
12 / 32
![Page 13: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/13.jpg)
Deep Q-learning Algorithm
13 / 32
![Page 14: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/14.jpg)
However,
DQN can only handle discrete and low-dimension action spaces.
Recall:
• Learning Policy:
arg maxa
Q(φ(st), a; θ)
• Update Rule:
yj := rj + γmaxa′
Q(φj+1, a′; θ−)
14 / 32
![Page 15: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/15.jpg)
Discretize the action space?
• not scalable
actions increases exponentially37 = 2187
• not fine-tuning
naive discretization removes crucial information:the structure of A, the optimal action
15 / 32
![Page 16: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/16.jpg)
Policy-Based Methods
Deterministic Policy:µθ : S → A
where θ ∈ Rn is a vector of parameters.
Agent’s goal: obtain an optimal policy which maximizes the expected discountedreturn from the start state, denoted by the performance objective
J(µθ) := E[G1|µθ] (3)
16 / 32
![Page 17: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/17.jpg)
J(µθ) := E[G1|µθ]
In a more specific way,
J(µθ) =∫Sρµ(s)r(s, µθ)ds
=Es∼ρµ [r(s, µθ(s))],(4)
where ρµ(s’) is the (improper) discounted state distribution:
ρπ(s′) :=∫S
∞∑t=1
γt−1p1(s)p(s→ s′, t, π)ds (5)
17 / 32
![Page 18: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/18.jpg)
Deterministic Policy Ascent
Theorem 1Deterministic Policy Gradient Theorem (Silver et al., 2014)
∇θJ(µθ) =∫Sρµ(s)∇θQµ(s, µθ(s))ds
=Es∼ρµ [∇θQµ(s, µθ(s))]
=Es∼ρµ [∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]
(6)
Question:
How to estimate the action-value function Qπ(s, a)?
18 / 32
![Page 19: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/19.jpg)
Actor-Critic
• Actor:adjust the stochastic policy µθ(s) by stochastic gradient ascent.
• Critic: inspired by the success of DQNestimate the action-value function
Qπ(s, a) ≈ Qµθ (s, a) (7)
DNN? Experience Replay? Target Network?
19 / 32
![Page 20: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/20.jpg)
Off-Policy Deterministic Policy Gradient(Silver et.al, 2014)
∇θµµ ≈ Eµ′ [∇θµQ(s, a|θQ)|s=st,a=µ(st|θµ)] (8)
= Eµ′ [∇aQ(s, a|θQ)|s=st,a=µ(st)∇θµµ(s|θµ)|s=st ] (9)
20 / 32
![Page 21: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/21.jpg)
21 / 32
![Page 22: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/22.jpg)
22 / 32
![Page 23: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/23.jpg)
Asynchronous One-Step Q-learning
23 / 32
![Page 24: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/24.jpg)
Asynchronous Advantage Actor-Critic(A3C)
24 / 32
![Page 25: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/25.jpg)
MengDi Wang: Primal-Dual π
Infinite-horizon Average-reward MDP
• Maximize value function
•
• Bellman equation
25 / 32
![Page 26: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/26.jpg)
Linear Programming
• Primal
• Dual
26 / 32
![Page 27: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/27.jpg)
• Minimax formulation
27 / 32
![Page 28: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/28.jpg)
Dual-ascent-primal-descent
28 / 32
![Page 29: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/29.jpg)
Simplify for implementation
29 / 32
![Page 30: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/30.jpg)
Algorithm
30 / 32
![Page 31: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/31.jpg)
Convergence
31 / 32
![Page 32: Reinforcement Learning: An Overview - GitHub Pages · General Approach for π∗ • If the model p(s0,r|s,a) is available →dynamic programming. • If no model →reinforcement](https://reader035.vdocuments.site/reader035/viewer/2022081612/5f1599b07847f66e9702370b/html5/thumbnails/32.jpg)
References:
Richard S Sutton and Andrew G Barto. Reinforcement learning: Anintroduction. Vol. 1. 1. MIT press Cambridge, 1998.
Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards.”PhD thesis. King’s College, Cambridge, 1989.
32 / 32