introduction to deep q-network - home - school of...

Introduction to Deep Q-network

Presenter: Yunshu Du

CptS 580 Deep Learning

10/10/2016

Deep Q-network (DQN)


• An artificial agent for general Atari game playing

– Learn to master 49 different Atari games directly from game

screens

– Beat the best performing learner from the same domain in 43

games

– Excel human expert in 29 games


• A demo on DQN playing Atari Breakout

https://www.youtube.com/watch?v=V1eYniJ0Rnk

DQN is reinforcement learning + CNN magic!

• “Q”: Q-learning, a reinforcement learning (RL) method, the

agent interact with the environment to maximize future

rewards

• “Deep”, “network” : deep artificial neural networks to

learn general representation in complex environments

Q-Learning

• Action-value (Q) function

• Optimal Q function obeys Bellman equation

• The Q-Learning algorithm

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

approximator

Q-Learning

• Exploration vs. Exploitation

– Do I want to know as much as possible, or do my best at

things that I already know?

– ε-greedy exploration to select actions

approximator


Example: Q-Learning for Atari Breakout

Q-Learning

• But what if there are too many states/actions?

– Solution: deep convolutional network as function

approximator

weights

Deep Convolutional neural network (CNN)

• Extracts features directly from raw pixel

• Atari game image pre-processing: 84x84x4

http://cs231n.github.io/convolutional-networks/

DQN Architecture

8x8

Input image: 84x84x4

84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32

http://www.slideshare.net/onghaoyi/distributed-deep-qlearning

output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

DQN Architecture

8x8


84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32


output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

64 filters

3x3 stride 1

7x7

output size

= 7*7*64

#W2 = 36864

(3*3*64)*64

Convolutional

DQN Architecture

8x8


84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32


output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

64 filters

3x3 stride 1

7x7

output size

= 7*7*64

#W2 = 36864

(3*3*64)*64

512 rectifierReshape

3136

Output Q values

for each action

Fully ConnectedConvolutionalAny missing

component?

Q-Learning

• Problem: Reinforcement learning is known to be unstable

or even to diverge when use a nonlinear function

approximator such as a neural network

– Correlation between samples

– Small updates to Q value may significantly change the policy

Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE

transactions on automatic control, 42(5), 674-690.

Deep

Q-Learning

• Solutions in DQN

– Experience replay

• Each iterations store experience sequence

et

= (st,a

t,r

t,s

t + 1), D

t= {e

1,…,e

t}

• Randomly drawn samples of experience (s,a,r,s′) ~ U(D) and apply

Q update in minibatch fashion

– Separate target network

• Clone Q(s,a; θ) to a separate target Qˆ(s,a; θ–) every C time step

• Treat y as the target and θ–are held fixed while update

– Reward clipping

• {-1, 1}

Deep


• Minimize squared error loss

• Stochastic gradient decent w.r.t. weights

– Minibatch of size 32

• Update weights using RMSprop: divide weights by a

running average

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

https://en.wikipedia.org/wiki/Stochastic_gradient_descent

target prediction

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

DQN: Putting Together

Input CNN

Q value

for actions

Store experience

{st,at,rt,st + 1} then

Sample minibatch

Calculate target for

each sample

Calculate gradient and update weights

Q DQN


But … It’s not perfect!

• Reward clipping

– Agent can’t distinguish different scales of rewards

(e.g., Macman)

• Limited experience replay

– Might through away important experiences

• High computational complexity

– Almost 10 days to train one game on a single GPU! Slower on

physical robots

– 10+ GB to store experiences

Andrej Karpathy’s blog

http://karpathy.github.io/

Beyond DQN

• More stabled learning

– Double DQN (Van, H et al. (2015)): use two Q-networks, one

for select action, the other for evaluate action

• Limited experience replay

– Prioritized Experience Replay (Schaul, T et al. (2016)): weight

experience according to surprise

• High computational time complexity

– Parallel/distributed computing (Nair, A et al. (2015))

– Dueling network (Wang, Z et al. (2015))L split DQN into two

channels

– Asynchronous RL (A3C) (Mnih, V et al. (2016)): can be trained

in CPU

David Silver’s tutorial on Deep Reinforcement Learning

ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

Beyond DQN



Beyond DQN

• Deep Policy Network for continuous control

– Simulated robots

– Physical robots



Beyond DQN

Mastering the game of

Go with deep neural

networks and tree search

Silver, D., Huang, A.,

Maddison, C.J., Guez, A.,

Sifre, L., Van Den Driessche,

G., Schrittwieser, J.,

Antonoglou, I.,

Panneershelvam, V., Lanctot,

M. and Dieleman, S., 2016.

So … DQN is not magic

• Q learning + CNN as function approximator

• Experience replay + separate target + reward clipping

= stabilize learning

• To be continue …

Introduction to Deep Q-network

Presenter: Yunshu Du

CptS 580 Deep Learning

10/10/2016

introduction to deep q-network - home - school of...

Documents