introduction to deep q-network - home - school of...
TRANSCRIPT
![Page 1: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/1.jpg)
Introduction to Deep Q-network
Presenter: Yunshu Du
CptS 580 Deep Learning
10/10/2016
![Page 2: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/2.jpg)
Deep Q-network (DQN)
![Page 3: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/3.jpg)
Deep Q-network (DQN)
• An artificial agent for general Atari game playing
– Learn to master 49 different Atari games directly from game
screens
– Beat the best performing learner from the same domain in 43
games
– Excel human expert in 29 games
![Page 4: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/4.jpg)
Deep Q-network (DQN)
• A demo on DQN playing Atari Breakout
https://www.youtube.com/watch?v=V1eYniJ0Rnk
![Page 5: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/5.jpg)
![Page 6: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/6.jpg)
DQN is reinforcement learning + CNN magic!
• “Q”: Q-learning, a reinforcement learning (RL) method, the
agent interact with the environment to maximize future
rewards
• “Deep”, “network” : deep artificial neural networks to
learn general representation in complex environments
![Page 7: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/7.jpg)
Q-Learning
• Action-value (Q) function
• Optimal Q function obeys Bellman equation
• The Q-Learning algorithm
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
approximator
![Page 8: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/8.jpg)
Q-Learning
• Exploration vs. Exploitation
– Do I want to know as much as possible, or do my best at
things that I already know?
– ε-greedy exploration to select actions
approximator
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
![Page 9: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/9.jpg)
Example: Q-Learning for Atari Breakout
![Page 10: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/10.jpg)
Q-Learning
• But what if there are too many states/actions?
– Solution: deep convolutional network as function
approximator
weights
![Page 11: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/11.jpg)
Deep Convolutional neural network (CNN)
• Extracts features directly from raw pixel
• Atari game image pre-processing: 84x84x4
http://cs231n.github.io/convolutional-networks/
![Page 12: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/12.jpg)
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
![Page 13: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/13.jpg)
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
64 filters
3x3 stride 1
7x7
output size
= 7*7*64
#W2 = 36864
(3*3*64)*64
Convolutional
![Page 14: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/14.jpg)
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
64 filters
3x3 stride 1
7x7
output size
= 7*7*64
#W2 = 36864
(3*3*64)*64
512 rectifierReshape
3136
Output Q values
for each action
Fully ConnectedConvolutionalAny missing
component?
![Page 15: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/15.jpg)
Q-Learning
• Problem: Reinforcement learning is known to be unstable
or even to diverge when use a nonlinear function
approximator such as a neural network
– Correlation between samples
– Small updates to Q value may significantly change the policy
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE
transactions on automatic control, 42(5), 674-690.
Deep
![Page 16: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/16.jpg)
Q-Learning
• Solutions in DQN
– Experience replay
• Each iterations store experience sequence
et
= (st,a
t,r
t,s
t + 1), D
t= {e
1,…,e
t}
• Randomly drawn samples of experience (s,a,r,s′) ~ U(D) and apply
Q update in minibatch fashion
– Separate target network
• Clone Q(s,a; θ) to a separate target Qˆ(s,a; θ–) every C time step
• Treat y as the target and θ–are held fixed while update
– Reward clipping
• {-1, 1}
Deep
![Page 17: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/17.jpg)
Deep Q-network (DQN)
• Minimize squared error loss
• Stochastic gradient decent w.r.t. weights
– Minibatch of size 32
• Update weights using RMSprop: divide weights by a
running average
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
target prediction
![Page 18: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/18.jpg)
DQN: Putting Together
Input CNN
Q value
for actions
Store experience
{st,at,rt,st + 1} then
Sample minibatch
Calculate target for
each sample
Calculate gradient and update weights
![Page 19: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/19.jpg)
Q DQN
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
![Page 20: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/20.jpg)
But … It’s not perfect!
• Reward clipping
– Agent can’t distinguish different scales of rewards
(e.g., Macman)
• Limited experience replay
– Might through away important experiences
• High computational complexity
– Almost 10 days to train one game on a single GPU! Slower on
physical robots
– 10+ GB to store experiences
Andrej Karpathy’s blog
![Page 21: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/21.jpg)
Beyond DQN
• More stabled learning
– Double DQN (Van, H et al. (2015)): use two Q-networks, one
for select action, the other for evaluate action
• Limited experience replay
– Prioritized Experience Replay (Schaul, T et al. (2016)): weight
experience according to surprise
• High computational time complexity
– Parallel/distributed computing (Nair, A et al. (2015))
– Dueling network (Wang, Z et al. (2015))L split DQN into two
channels
– Asynchronous RL (A3C) (Mnih, V et al. (2016)): can be trained
in CPU
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
![Page 22: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/22.jpg)
Beyond DQN
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
![Page 23: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/23.jpg)
Beyond DQN
• Deep Policy Network for continuous control
– Simulated robots
– Physical robots
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
![Page 24: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/24.jpg)
Beyond DQN
Mastering the game of
Go with deep neural
networks and tree search
Silver, D., Huang, A.,
Maddison, C.J., Guez, A.,
Sifre, L., Van Den Driessche,
G., Schrittwieser, J.,
Antonoglou, I.,
Panneershelvam, V., Lanctot,
M. and Dieleman, S., 2016.
![Page 25: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/25.jpg)
So … DQN is not magic
• Q learning + CNN as function approximator
• Experience replay + separate target + reward clipping
= stabilize learning
• To be continue …
![Page 26: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580](https://reader034.vdocuments.site/reader034/viewer/2022042708/5aa7b1a57f8b9ac5648c7726/html5/thumbnails/26.jpg)
Introduction to Deep Q-network
Presenter: Yunshu Du
CptS 580 Deep Learning
10/10/2016