special topics in deep learning comp 6211d & …...monte carlo tree search: d. silver, a. huang,...

Special Topics in Deep Learning

COMP 6211D & ELEC 6910T

Instructor: Qifeng Chen

Deep Learning for Reinforcement Learning

How do we build intelligent machines?

Intelligent machines must be able to adapt

Deep learning helps us handle unstructured environments

Reinforcement learning provides a formalism for behavior

decisions (actions)

consequencesobservationsrewards

Mnih et al. ‘13Schulman et al. ’14 & ‘15

Levine*, Finn*, et al. ‘16

What is deep RL, and why should we care?

standardcomputer

vision

features(e.g. HOG)

mid-level features(e.g. DPM)

classifier(e.g. SVM)

deeplearning

Felzenszwalb ‘08

end-to-end training

standardreinforcement

learning

features more features linear policyor value func.

deepreinforcement

learning

end-to-end training

? ? action

action

What does end-to-end learning mean for sequential decision making?

Action(run away)

perception

action

Action(run away)

sensorimotor loop

Example: robotics

roboticcontrolpipeline

observationsstate

estimation(e.g. vision)

modeling & prediction

planninglow-level control

controls

tiny, highly specialized “visual cortex”

tiny, highly specialized “motor cortex”

The reinforcement learning problem is the AI problem!

decisions (actions)

consequencesobservationsrewards

Actions: muscle contractionsObservations: sight, smellRewards: food

Actions: motor current or torqueObservations: camera imagesRewards: task success measure (e.g., running speed)

Actions: what to purchaseObservations: inventory levelsRewards: profit

Deep models are what allow reinforcement learning algorithms to solve complex problems end to end!

Why should we study this now?

1. Advances in deep learning

2. Advances in reinforcement learning

3. Advances in computational capability


L.-J. Lin, “Reinforcement learning for robots using neural networks.” 1993

Tesauro, 1995


Atari games:Q-learning:V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013).

Policy gradients:J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. (2015).V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcementlearning”. (2016).

Real-world robots:Guided policy search:S. Levine*, C. Finn*, T. Darrell, P. Abbeel. “End-to-end training of deep visuomotor policies”. (2015).

Q-learning:D. Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. (2018).

Beating Go champions:Supervised learning + policy gradients + value functions + Monte Carlo tree search:D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, et al. “Mastering the game of Go with deep neural networks and tree search”. Nature (2016).

What other problems do we need to solve to enable real-world sequential decision making?

Beyond learning from reward

• Basic reinforcement learning deals with maximizing rewards

• This is not the only problem that matters for sequential decision making!

• We will cover more advanced topics• Learning reward functions from example (inverse reinforcement learning)

• Transferring knowledge between domains (transfer learning, meta-learning)

• Learning to predict and using prediction to act

Where do rewards come from?

Are there other forms of supervision?

• Learning from demonstrations• Directly copying observed behavior

• Inferring rewards from observed behavior (inverse reinforcement learning)

• Learning from observing the world• Learning to predict

• Unsupervised learning

• Learning from other tasks• Transfer learning

• Meta-learning: learning to learn

Imitation learning

Bojarski et al. 2016

More than imitation: inferring intentions

Warneken & Tomasello

Inverse RL examples

Finn et al. 2016

Prediction

Ebert et al. 2017

Prediction for real-world control

Xie et al. 2019

Using tools with predictive models

Playing games with predictive models

Kaiser et al. 2019

realpredicted

But sometimes there are issues…

6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19

Reinforcement Learning (RL): Key Concepts

ACTIONS

OBSERVATIONSState changes: !"#$

Reward: %"

Action: &"

': discount factor

AGENT ENVIRONMENT

)" = +,-"

.',%, = '"%" + '"#$%"#$ …+ '"#1%"#1 + ⋯

Discounted Total Reward


Defining the Q-function

!" = $" + &$"'( +&)$"') + ⋯

+ ,, . = / !"

Total reward, !" , is the discounted sum of all rewards obtained from time 0

The Q-function captures the expected total future reward an agent in state, ,, can receive by executing a certain action, .


How to take actions given a Q-function?

! ", $ = & '((state, action)

Ultimately, the agent needs a policy ) * , to infer the best action to take at its state, s

Strategy: the policy should choose an action that maximizes future reward

+∗ " = argmax2

!(", $)

Deep Reinforcement Learning Algorithms

Value Learning Policy Learning

Find ! "Sample # ~ ! "

Find % ", ## = argmax

-%(", #)


Digging deeper into the Q-function

Example: Atari Breakout

It can be very difficult for humans to accurately estimate Q-values

A B

Which (s,a) pair has a higher Q-value?



Example: Atari Breakout - Middle


A B




Example: Atari Breakout - Side


A B


6.S191 Introduction to Deep Learningintrotodeeplearning.com

1/30/19

Deep Q Networks (DQN)

! ", $state, "

action, $

“moveright”

DeepNN

How can we use deep neural networks to model Q-functions?


1/30/19

Deep Q Networks (DQN)

! ", $state, "

action, $

“moveright”

DeepNN

! ", $%

state, "

DeepNN

! ", $&

! ", $'



1/30/19

Deep Q Networks (DQN): Training

! ", $state, "

action, $

“moveright”

DeepNN

! ", $%

state, "

DeepNN

! ", $&

! ", $'


ℒ = * + + -max12 !("2, $2) − ! ", $&


1/30/19

Deep Q Networks (DQN): Training

! ", $state, "

action, $

“moveright”

DeepNN

! ", $%

state, "

DeepNN

! ", $&

! ", $'


ℒ = * + + -max12 !("2, $2) − ! ", $&

predictedtarget


DQN Atari Results


DQN Atari Results

Surpass human-level

Below human-level


Downsides of Q-learning

Complexity: • Can model scenarios where the action space is discrete and small• Cannot handle continuous action spaces

Flexibility: • Cannot learn stochastic policies since policy is deterministically computed

from the Q function

To overcome, consider a new class of RL training algorithms:Policy gradient methods

IMPORTANT: Imagine you want to predictsteering wheel angle of a car!


Policy Gradient (PG) : Key Idea

DQN (before): Approximating Q and inferring the optimal policy,

! ", $%

state, "

DeepNN

! ", $&

! ", $'


Policy Gradient (PG): Key Idea


Policy Gradient: Directly optimize the policy!

! "#|%

state, %

DeepNN

! "&|%

! "'|%





! "#|%

state, %

DeepNN

! "&|%

! "'|%

()*∈,

! "-|% = 1





! "#|%

state, %

DeepNN

! "&|%

! "'|%

()*∈,

! "-|% = 1

! "|% = 0("23456|%3"37)


Policy Gradient (PG): Training

1. Run a policy for a while2. Increase probability of actions that lead to high

rewards3. Decrease probability of actions that lead to

low/no rewards

function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇*log )* .:|%: <:! ← ! + >∇

return !





low/no rewards

function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇789: ;7 <=|?= @=! ← ! + B∇

return !





low/no rewards

function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇789: ;7 <=|?= @=! ← ! + B∇

return !

∇*log )* .F|%F GFlog-likelihood of action

reward


The Game of Go

Board Size n x n Positions 3$% % Legal Legal Positions

1×1 3 33.33% 12×2 81 70.37% 573×3 19,683 64.40% 12,6754×4 43,046,721 56.49% 24,318,1655×5 847,288,609,443 48.90% 414,295,148,7419×9 4.434264882×1038 23.44% 1.03919148791×1038

13×13 4.300233593×1080 8.66% 3.72497923077×1079

19×19 1.740896506×10172 1.20% 2.08168199382×10170

Greater number of legal board positions than atoms in the universe.

Aim: Get more board territory than your opponent.

Source: Wikipedia.


AlphaGo Beats Top Human Player at Go (2016)

Silver et al., Nature 2016.




1) Initial training: human data





2) Self-play and reinforcement learningà super-human performance





2) Self-play and reinforcement learningà super-human performance

3) “Intuition” about board state

special topics in deep learning comp 6211d & …...monte carlo tree search: d. silver, a. huang,...

Documents