deep reinforcement learningweb.stanford.edu/class/aa228/drl.pdf · deep q-learning: challenges when...

Deep Reinforcement Learning

Jeremy Morton

November 29, 2017

Jeremy Morton Deep RL November 29, 2017 1 / 22

Reinforcement Learning Challenges

1. Balancing Exploration with Exploitation

2. Credit Assignment

3. Generalizing from experience

How can we generalize effectively in large (and possibly continuous) state and action spaces?

Driving from Images Continuous Action Space


Global Approximation

Recall global approximation:Q(s, a) = θa

ᵀβ(s)

What is β(s)? If β(s) = s, then this is may not be a good approximation of Q(s, a).

But how can we learn appropriate features and weights so that this approximation is accurate?


Neural Networks: Overview

Here, we have a supervised learning problem and would like to approximate y.



i1 = wᵀ1x+ b1



o1 = σ(wᵀ1x+ b1)



y = w3 σ(w2 σ(wᵀ1x+ b1) + b2) + b3



Neural networks transform inputs via a set of matrix multiplications and element-wisenonlinearities.

y =W3 σ(W2 σ(W1x+ b1) + b2) + b3Jeremy Morton Deep RL November 29, 2017 4 / 22

Neural Network Properties

Neural networks are universal function approximators.

Recall the properties of the sigmoid function: we can control its sharpness and shift it.

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(x)

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(1000x)

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(1000x− 5000)




Imagine we wanted to approximate f(x) = x:

0 0.5 1 1.5 2 2.5 30

1

2

3




Imagine you have a neural network with three neurons that take in x and output the following:

0 0.5 1 1.5 2 2.5 3

0

0.2

0.4

0.6

0.8

1

σ(1000x− 1000)

0 0.5 1 1.5 2 2.5 3

−1

−0.8

−0.6

−0.4

−0.2

0

−σ(1000x− 2000)

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2σ(1000x− 2000)




Now sum the neuron outputs together:

0 0.5 1 1.5 2 2.5 30

1

2

3




As more neurons are added, the approximation to f(x) improves.

0 0.5 1 1.5 2 2.5 30

1

2

3

Good explanation here.


http://neuralnetworksanddeeplearning.com/chap4.html


Neural networks can be trained efficiently through backpropagation.

Example for network with single hidden layer:

y =W2 σ(W1x+ b1) + b2

If loss function L is a function of y then by chain rule:

∂L∂W2

=∂L∂y

(∂y

∂W2

)ᵀ

=∂L∂y

σ(W1x+ b1)ᵀ

Finally, can update W2 using gradient descent:

W2 ←W2 − α∂L∂W2


Deep Q-Learning

V. Mnih, et al. (2013). Playing Atari with Deep Reinforcement Learning. Available athttps://arxiv.org/abs/1312.5602


https://arxiv.org/abs/1312.5602

Deep Q-Learning: Objective

Recall the Bellman equation:

Q∗(s, a) = Es′∼E[R(s, a) + γmax

a′Q∗(s′, a′)

]We estimate the value of Q∗(s, a) using a neural network with parameters θ. At iteration i,the loss function is given by the temporal difference error:

Li(θi) = Es,a∼ρ(·);s′∼E

[(R(s, a) + γmax

a′Q(s′, a′; θi−1)−Q(s, a; θi)

)2]

where θi−1 are the parameter values from the previous iteration.


Deep Q-Learning: Challenges

I Often in supervised learning it is assumed that successive samples are iid

I However, in reinforcement learning successive samples are highly correlated

I To combat this, transitions were stored in replay memory DI During training, random transitions {st, at, rt, st+1} were sampled from D and used to

train the network



When performing regression in supervised learning, it is common to use the least-squares loss:

L =1

2(y − y)2

where y is the target value and y is the model prediction. In the case of Q-Learning, thistarget value is:

y = rt + γmaxa′

Q(st+1, a′; θ).

Because of this, the network target value changes as the network is trained. The deepQ-Learning algorithm uses a target network whose parameters are changed less frequently inorder to have a stationary target value.



I Input images are 210× 160 pixels with three color channels, so huge inputs and statespace → Images are downsampled, converted to grayscale, and cropped to be 84× 84pixels.

I Images are static, so many states are perceptually aliased → The input to the network isa sequence of 4 preprocessed frames.


Deep Q-Learning: Network Architecture


Deep Q-Learning: Algorithm


Deep Q-Learning: Results


Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.






I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].

I Can be applied to both discrete and continuous action spaces.






I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.


Policy Gradients: Estimating Q

Given a sample trajectory {s1, a1, r1, . . . sT−1, aT−1, rT }, can create an empirical estimate ofQ:

Qπθ(st, at) =

T∑k=t

γk−1rk

Then perform the following parameter update at each time step:

θ ← θ + α∇θ log πθ(at | st)Qπθ(st, at).

This is known as the REINFORCE algorithm. It is also possible to use a second neuralnetwork to estimate Q. This is known as an actor-critic algorithm.


Policy Gradients: Demonstration

Full video here.Jeremy Morton Deep RL November 29, 2017 18 / 22

https://www.youtube.com/watch?v=ATvp0Hp7RUI

Further Reading

V. Mnih, et al. (2016). Asynchronous Methods for Deep Reinforcement Learning.Available at https://arxiv.org/abs/1602.01783.

Credit: Medium post by A. Juliani here

Performance of Asynchronous AdvantageActor-Critic (A3C) algorithm



https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

Further Reading

M. Hessel, et al. (2017). Rainbow: Combining Improvements in Deep ReinforcementLearning. Available at https://arxiv.org/abs/1710.02298.



Further Reading

I J. Schulman, et al. (2015). Trust Region Policy Optimization. Available athttp://arxiv.org/abs/1502.05477.

I D. Silver, et al. (2016). Mastering the game of Go with deep neural networks andtree search. Nature, 529(7587), pp.484-489. (AlphaGo paper)

I D. Silver, et al. (2017). Mastering the game of Go without human knowledge.Nature, 550(7676), pp.354-359. (AlphaGo Zero paper)

I Y. Li (2017). Deep Reinforcement Learning: An Overview. Available athttps://arxiv.org/abs/1701.07274v5. (Long Survey)

I K. Arulkumaran, et al. (2017). A Brief Survey of Deep Reinforcement Learning.Available at https://arxiv.org/abs/1708.05866v2. (Shorter Survey)


http://arxiv.org/abs/1502.05477

https://arxiv.org/abs/1701.07274v5

https://arxiv.org/abs/1708.05866v2

Deep Learning Frameworks

Currently, the (arguably) two most popular frameworks are:

I TensorFlow (https://www.tensorflow.org)

I PyTorch (http://pytorch.org)

With both, you can construct and train networks in Python. Other frameworks includeTheano, CNTK, Caffe, MXNet, etc.

For easy network construction and training, check out Keras.

A much more in-depth discussion of deep learning frameworks can be found in the CS 231ncourse slides.


https://www.tensorflow.org

http://pytorch.org

https://keras.io

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf

Additional Resources

Neural Networks:

I CS 231n Course Notes (http://cs231n.stanford.edu)

I Michael Nielsen’s free textbook Neural Networks and Deep Learning(http://neuralnetworksanddeeplearning.com)

I Deep Learning by Bengio et al. (http://www.deeplearningbook.org)

Deep Reinforcement Learning:

I CS 294 Course Notes (UC Berkeley) (http://rll.berkeley.edu/deeprlcourse/)

I David Silver’s RL Course Notes(http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html)

I OpenAI Gym (https://gym.openai.com/docs/)


http://cs231n.stanford.edu

http://neuralnetworksanddeeplearning.com

http://www.deeplearningbook.org

http://rll.berkeley.edu/deeprlcourse/

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

https://gym.openai.com/docs/

Thank You

Questions?


deep reinforcement learningweb.stanford.edu/class/aa228/drl.pdf · deep q-learning: challenges when...

Documents