deep reinforcement learningweb.stanford.edu/class/aa228/drl.pdf · deep q-learning: challenges when...

43
Deep Reinforcement Learning Jeremy Morton November 29, 2017 Jeremy Morton Deep RL November 29, 2017 1 / 22

Upload: dangquynh

Post on 13-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Reinforcement Learning

Jeremy Morton

November 29, 2017

Jeremy Morton Deep RL November 29, 2017 1 / 22

Page 2: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Reinforcement Learning Challenges

1. Balancing Exploration with Exploitation

2. Credit Assignment

3. Generalizing from experience

How can we generalize effectively in large (and possibly continuous) state and action spaces?

Driving from Images Continuous Action Space

Jeremy Morton Deep RL November 29, 2017 2 / 22

Page 3: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Global Approximation

Recall global approximation:Q(s, a) = θa

ᵀβ(s)

What is β(s)? If β(s) = s, then this is may not be a good approximation of Q(s, a).

But how can we learn appropriate features and weights so that this approximation is accurate?

Jeremy Morton Deep RL November 29, 2017 3 / 22

Page 4: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

Here, we have a supervised learning problem and would like to approximate y.

Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 5: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

i1 = wᵀ1x+ b1

Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 6: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

o1 = σ(wᵀ1x+ b1)

Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 7: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

o1 = σ(wᵀ1x+ b1)

Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 8: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

y = w3 σ(w2 σ(wᵀ1x+ b1) + b2) + b3

Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 9: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Networks: Overview

Neural networks transform inputs via a set of matrix multiplications and element-wisenonlinearities.

y =W3 σ(W2 σ(W1x+ b1) + b2) + b3Jeremy Morton Deep RL November 29, 2017 4 / 22

Page 10: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks are universal function approximators.

Recall the properties of the sigmoid function: we can control its sharpness and shift it.

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(x)

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(1000x)

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

σ(1000x− 5000)

Jeremy Morton Deep RL November 29, 2017 5 / 22

Page 11: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks are universal function approximators.

Imagine we wanted to approximate f(x) = x:

0 0.5 1 1.5 2 2.5 30

1

2

3

Jeremy Morton Deep RL November 29, 2017 5 / 22

Page 12: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks are universal function approximators.

Imagine you have a neural network with three neurons that take in x and output the following:

0 0.5 1 1.5 2 2.5 3

0

0.2

0.4

0.6

0.8

1

σ(1000x− 1000)

0 0.5 1 1.5 2 2.5 3

−1

−0.8

−0.6

−0.4

−0.2

0

−σ(1000x− 2000)

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2σ(1000x− 2000)

Jeremy Morton Deep RL November 29, 2017 5 / 22

Page 13: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks are universal function approximators.

Now sum the neuron outputs together:

0 0.5 1 1.5 2 2.5 30

1

2

3

Jeremy Morton Deep RL November 29, 2017 5 / 22

Page 14: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks are universal function approximators.

As more neurons are added, the approximation to f(x) improves.

0 0.5 1 1.5 2 2.5 30

1

2

3

Good explanation here.

Jeremy Morton Deep RL November 29, 2017 5 / 22

Page 15: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks can be trained efficiently through backpropagation.

Example for network with single hidden layer:

y =W2 σ(W1x+ b1) + b2

If loss function L is a function of y then by chain rule:

∂L∂W2

=∂L∂y

(∂y

∂W2

)ᵀ

=∂L∂y

σ(W1x+ b1)ᵀ

Finally, can update W2 using gradient descent:

W2 ←W2 − α∂L∂W2

Jeremy Morton Deep RL November 29, 2017 6 / 22

Page 16: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks can be trained efficiently through backpropagation.

Example for network with single hidden layer:

y =W2 σ(W1x+ b1) + b2

If loss function L is a function of y then by chain rule:

∂L∂W2

=∂L∂y

(∂y

∂W2

)ᵀ

=∂L∂y

σ(W1x+ b1)ᵀ

Finally, can update W2 using gradient descent:

W2 ←W2 − α∂L∂W2

Jeremy Morton Deep RL November 29, 2017 6 / 22

Page 17: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Neural Network Properties

Neural networks can be trained efficiently through backpropagation.

Example for network with single hidden layer:

y =W2 σ(W1x+ b1) + b2

If loss function L is a function of y then by chain rule:

∂L∂W2

=∂L∂y

(∂y

∂W2

)ᵀ

=∂L∂y

σ(W1x+ b1)ᵀ

Finally, can update W2 using gradient descent:

W2 ←W2 − α∂L∂W2

Jeremy Morton Deep RL November 29, 2017 6 / 22

Page 18: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning

V. Mnih, et al. (2013). Playing Atari with Deep Reinforcement Learning. Available athttps://arxiv.org/abs/1312.5602

Jeremy Morton Deep RL November 29, 2017 7 / 22

Page 19: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Objective

Recall the Bellman equation:

Q∗(s, a) = Es′∼E[R(s, a) + γmax

a′Q∗(s′, a′)

]We estimate the value of Q∗(s, a) using a neural network with parameters θ. At iteration i,the loss function is given by the temporal difference error:

Li(θi) = Es,a∼ρ(·);s′∼E

[(R(s, a) + γmax

a′Q(s′, a′; θi−1)−Q(s, a; θi)

)2]

where θi−1 are the parameter values from the previous iteration.

Jeremy Morton Deep RL November 29, 2017 8 / 22

Page 20: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Challenges

I Often in supervised learning it is assumed that successive samples are iid

I However, in reinforcement learning successive samples are highly correlated

I To combat this, transitions were stored in replay memory DI During training, random transitions {st, at, rt, st+1} were sampled from D and used to

train the network

Jeremy Morton Deep RL November 29, 2017 9 / 22

Page 21: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Challenges

When performing regression in supervised learning, it is common to use the least-squares loss:

L =1

2(y − y)2

where y is the target value and y is the model prediction. In the case of Q-Learning, thistarget value is:

y = rt + γmaxa′

Q(st+1, a′; θ).

Because of this, the network target value changes as the network is trained. The deepQ-Learning algorithm uses a target network whose parameters are changed less frequently inorder to have a stationary target value.

Jeremy Morton Deep RL November 29, 2017 10 / 22

Page 22: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Challenges

I Input images are 210× 160 pixels with three color channels, so huge inputs and statespace → Images are downsampled, converted to grayscale, and cropped to be 84× 84pixels.

I Images are static, so many states are perceptually aliased → The input to the network isa sequence of 4 preprocessed frames.

Jeremy Morton Deep RL November 29, 2017 11 / 22

Page 23: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Network Architecture

Jeremy Morton Deep RL November 29, 2017 12 / 22

Page 24: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Algorithm

Jeremy Morton Deep RL November 29, 2017 13 / 22

Page 25: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Q-Learning: Results

Jeremy Morton Deep RL November 29, 2017 14 / 22

Page 26: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.

Jeremy Morton Deep RL November 29, 2017 15 / 22

Page 27: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.

Jeremy Morton Deep RL November 29, 2017 15 / 22

Page 28: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.

Jeremy Morton Deep RL November 29, 2017 15 / 22

Page 29: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].

I Can be applied to both discrete and continuous action spaces.

Jeremy Morton Deep RL November 29, 2017 15 / 22

Page 30: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradient Methods

I In Deep Q-learning, a neural network is used to approximate Q∗(s, a).

I In policy gradient methods, neural networks are instead used to represent a policyπθ(a | s).

I Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

I Parameters θ optimized to maximize expectation over future rewards Ea∼πθ [Qπθ(s, a)].I Can be applied to both discrete and continuous action spaces.

Jeremy Morton Deep RL November 29, 2017 15 / 22

Page 31: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Calculating Gradients

How do we find gradients for the policy parameters?

∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da

=

∫∇θπθ(a | s)Qπθ(s, a)da

=

∫ ∇θπθ(a | s)πθ(a | s)

πθ(a | s)Qπθ(s, a)da

=

∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da

= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]

Jeremy Morton Deep RL November 29, 2017 16 / 22

Page 32: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Calculating Gradients

How do we find gradients for the policy parameters?

∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da

=

∫∇θπθ(a | s)Qπθ(s, a)da

=

∫ ∇θπθ(a | s)πθ(a | s)

πθ(a | s)Qπθ(s, a)da

=

∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da

= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]

Jeremy Morton Deep RL November 29, 2017 16 / 22

Page 33: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Calculating Gradients

How do we find gradients for the policy parameters?

∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da

=

∫∇θπθ(a | s)Qπθ(s, a)da

=

∫ ∇θπθ(a | s)πθ(a | s)

πθ(a | s)Qπθ(s, a)da

=

∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da

= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]

Jeremy Morton Deep RL November 29, 2017 16 / 22

Page 34: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Calculating Gradients

How do we find gradients for the policy parameters?

∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da

=

∫∇θπθ(a | s)Qπθ(s, a)da

=

∫ ∇θπθ(a | s)πθ(a | s)

πθ(a | s)Qπθ(s, a)da

=

∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da

= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]

Jeremy Morton Deep RL November 29, 2017 16 / 22

Page 35: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Calculating Gradients

How do we find gradients for the policy parameters?

∇θEa∼πθ [Qπθ(s, a)] = ∇θ∫πθ(a | s)Qπθ(s, a)da

=

∫∇θπθ(a | s)Qπθ(s, a)da

=

∫ ∇θπθ(a | s)πθ(a | s)

πθ(a | s)Qπθ(s, a)da

=

∫∇θ log πθ(a | s)πθ(a | s)Qπθ(s, a)da

= Ea∼πθ [∇θ log πθ(a | s)Qπθ(s, a)]

Jeremy Morton Deep RL November 29, 2017 16 / 22

Page 36: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradients: Estimating Q

Given a sample trajectory {s1, a1, r1, . . . sT−1, aT−1, rT }, can create an empirical estimate ofQ:

Qπθ(st, at) =

T∑k=t

γk−1rk

Then perform the following parameter update at each time step:

θ ← θ + α∇θ log πθ(at | st)Qπθ(st, at).

This is known as the REINFORCE algorithm. It is also possible to use a second neuralnetwork to estimate Q. This is known as an actor-critic algorithm.

Jeremy Morton Deep RL November 29, 2017 17 / 22

Page 37: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Policy Gradients: Demonstration

Full video here.Jeremy Morton Deep RL November 29, 2017 18 / 22

Page 38: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Further Reading

V. Mnih, et al. (2016). Asynchronous Methods for Deep Reinforcement Learning.Available at https://arxiv.org/abs/1602.01783.

Credit: Medium post by A. Juliani here

Performance of Asynchronous AdvantageActor-Critic (A3C) algorithm

Jeremy Morton Deep RL November 29, 2017 19 / 22

Page 39: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Further Reading

M. Hessel, et al. (2017). Rainbow: Combining Improvements in Deep ReinforcementLearning. Available at https://arxiv.org/abs/1710.02298.

Jeremy Morton Deep RL November 29, 2017 19 / 22

Page 40: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Further Reading

I J. Schulman, et al. (2015). Trust Region Policy Optimization. Available athttp://arxiv.org/abs/1502.05477.

I D. Silver, et al. (2016). Mastering the game of Go with deep neural networks andtree search. Nature, 529(7587), pp.484-489. (AlphaGo paper)

I D. Silver, et al. (2017). Mastering the game of Go without human knowledge.Nature, 550(7676), pp.354-359. (AlphaGo Zero paper)

I Y. Li (2017). Deep Reinforcement Learning: An Overview. Available athttps://arxiv.org/abs/1701.07274v5. (Long Survey)

I K. Arulkumaran, et al. (2017). A Brief Survey of Deep Reinforcement Learning.Available at https://arxiv.org/abs/1708.05866v2. (Shorter Survey)

Jeremy Morton Deep RL November 29, 2017 19 / 22

Page 41: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Deep Learning Frameworks

Currently, the (arguably) two most popular frameworks are:

I TensorFlow (https://www.tensorflow.org)

I PyTorch (http://pytorch.org)

With both, you can construct and train networks in Python. Other frameworks includeTheano, CNTK, Caffe, MXNet, etc.

For easy network construction and training, check out Keras.

A much more in-depth discussion of deep learning frameworks can be found in the CS 231ncourse slides.

Jeremy Morton Deep RL November 29, 2017 20 / 22

Page 42: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Additional Resources

Neural Networks:

I CS 231n Course Notes (http://cs231n.stanford.edu)

I Michael Nielsen’s free textbook Neural Networks and Deep Learning(http://neuralnetworksanddeeplearning.com)

I Deep Learning by Bengio et al. (http://www.deeplearningbook.org)

Deep Reinforcement Learning:

I CS 294 Course Notes (UC Berkeley) (http://rll.berkeley.edu/deeprlcourse/)

I David Silver’s RL Course Notes(http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html)

I OpenAI Gym (https://gym.openai.com/docs/)

Jeremy Morton Deep RL November 29, 2017 21 / 22

Page 43: Deep Reinforcement Learningweb.stanford.edu/class/aa228/drl.pdf · Deep Q-Learning: Challenges When performing regression in supervised learning, it is common to use the least-squares

Thank You

Questions?

Jeremy Morton Deep RL November 29, 2017 22 / 22