special topics in deep learning comp 6211d & …...monte carlo tree search: d. silver, a. huang,...
TRANSCRIPT
Reinforcement learning provides a formalism for behavior
decisions (actions)
consequencesobservationsrewards
Mnih et al. ‘13Schulman et al. ’14 & ‘15
Levine*, Finn*, et al. ‘16
What is deep RL, and why should we care?
standardcomputer
vision
features(e.g. HOG)
mid-level features(e.g. DPM)
classifier(e.g. SVM)
deeplearning
Felzenszwalb ‘08
end-to-end training
standardreinforcement
learning
features more features linear policyor value func.
deepreinforcement
learning
end-to-end training
? ? action
action
Example: robotics
roboticcontrolpipeline
observationsstate
estimation(e.g. vision)
modeling & prediction
planninglow-level control
controls
The reinforcement learning problem is the AI problem!
decisions (actions)
consequencesobservationsrewards
Actions: muscle contractionsObservations: sight, smellRewards: food
Actions: motor current or torqueObservations: camera imagesRewards: task success measure (e.g., running speed)
Actions: what to purchaseObservations: inventory levelsRewards: profit
Deep models are what allow reinforcement learning algorithms to solve complex problems end to end!
Why should we study this now?
1. Advances in deep learning
2. Advances in reinforcement learning
3. Advances in computational capability
Why should we study this now?
L.-J. Lin, “Reinforcement learning for robots using neural networks.” 1993
Tesauro, 1995
Why should we study this now?
Atari games:Q-learning:V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013).
Policy gradients:J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. (2015).V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcementlearning”. (2016).
Real-world robots:Guided policy search:S. Levine*, C. Finn*, T. Darrell, P. Abbeel. “End-to-end training of deep visuomotor policies”. (2015).
Q-learning:D. Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. (2018).
Beating Go champions:Supervised learning + policy gradients + value functions + Monte Carlo tree search:D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, et al. “Mastering the game of Go with deep neural networks and tree search”. Nature (2016).
Beyond learning from reward
• Basic reinforcement learning deals with maximizing rewards
• This is not the only problem that matters for sequential decision making!
• We will cover more advanced topics• Learning reward functions from example (inverse reinforcement learning)
• Transferring knowledge between domains (transfer learning, meta-learning)
• Learning to predict and using prediction to act
Are there other forms of supervision?
• Learning from demonstrations• Directly copying observed behavior
• Inferring rewards from observed behavior (inverse reinforcement learning)
• Learning from observing the world• Learning to predict
• Unsupervised learning
• Learning from other tasks• Transfer learning
• Meta-learning: learning to learn
Playing games with predictive models
Kaiser et al. 2019
realpredicted
But sometimes there are issues…
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Reinforcement Learning (RL): Key Concepts
ACTIONS
OBSERVATIONSState changes: !"#$
Reward: %"
Action: &"
': discount factor
AGENT ENVIRONMENT
)" = +,-"
.',%, = '"%" + '"#$%"#$ …+ '"#1%"#1 + ⋯
Discounted Total Reward
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Defining the Q-function
!" = $" + &$"'( +&)$"') + ⋯
+ ,, . = / !"
Total reward, !" , is the discounted sum of all rewards obtained from time 0
The Q-function captures the expected total future reward an agent in state, ,, can receive by executing a certain action, .
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
How to take actions given a Q-function?
! ", $ = & '((state, action)
Ultimately, the agent needs a policy ) * , to infer the best action to take at its state, s
Strategy: the policy should choose an action that maximizes future reward
+∗ " = argmax2
!(", $)
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find ! "Sample # ~ ! "
Find % ", ## = argmax
-%(", #)
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find ! "Sample # ~ ! "
Find % ", ## = argmax
-%(", #)
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Digging deeper into the Q-function
Example: Atari Breakout
It can be very difficult for humans to accurately estimate Q-values
A B
Which (s,a) pair has a higher Q-value?
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Digging deeper into the Q-function
Example: Atari Breakout
It can be very difficult for humans to accurately estimate Q-values
A B
Which (s,a) pair has a higher Q-value?
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Digging deeper into the Q-function
Example: Atari Breakout - Middle
It can be very difficult for humans to accurately estimate Q-values
A B
Which (s,a) pair has a higher Q-value?
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Digging deeper into the Q-function
Example: Atari Breakout - Side
It can be very difficult for humans to accurately estimate Q-values
A B
Which (s,a) pair has a higher Q-value?
6.S191 Introduction to Deep Learningintrotodeeplearning.com
1/30/19
Deep Q Networks (DQN)
! ", $state, "
action, $
“moveright”
DeepNN
How can we use deep neural networks to model Q-functions?
6.S191 Introduction to Deep Learningintrotodeeplearning.com
1/30/19
Deep Q Networks (DQN)
! ", $state, "
action, $
“moveright”
DeepNN
! ", $%
state, "
DeepNN
! ", $&
! ", $'
How can we use deep neural networks to model Q-functions?
6.S191 Introduction to Deep Learningintrotodeeplearning.com
1/30/19
Deep Q Networks (DQN): Training
! ", $state, "
action, $
“moveright”
DeepNN
! ", $%
state, "
DeepNN
! ", $&
! ", $'
How can we use deep neural networks to model Q-functions?
ℒ = * + + -max12 !("2, $2) − ! ", $&
6.S191 Introduction to Deep Learningintrotodeeplearning.com
1/30/19
Deep Q Networks (DQN): Training
! ", $state, "
action, $
“moveright”
DeepNN
! ", $%
state, "
DeepNN
! ", $&
! ", $'
How can we use deep neural networks to model Q-functions?
ℒ = * + + -max12 !("2, $2) − ! ", $&
predictedtarget
6.S191 Introduction to Deep Learningintrotodeeplearning.com
1/30/19
Deep Q Networks (DQN): Training
! ", $state, "
action, $
“moveright”
DeepNN
! ", $%
state, "
DeepNN
! ", $&
! ", $'
How can we use deep neural networks to model Q-functions?
ℒ = * + + -max12 !("2, $2) − ! ", $&
predictedtarget
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
DQN Atari Results
Surpass human-level
Below human-level
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Downsides of Q-learning
Complexity: • Can model scenarios where the action space is discrete and small• Cannot handle continuous action spaces
Flexibility: • Cannot learn stochastic policies since policy is deterministically computed
from the Q function
To overcome, consider a new class of RL training algorithms:Policy gradient methods
IMPORTANT: Imagine you want to predictsteering wheel angle of a car!
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG) : Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
! ", $%
state, "
DeepNN
! ", $&
! ", $'
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "#|%
state, %
DeepNN
! "&|%
! "'|%
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "#|%
state, %
DeepNN
! "&|%
! "'|%
()*∈,
! "-|% = 1
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "#|%
state, %
DeepNN
! "&|%
! "'|%
()*∈,
! "-|% = 1
! "|% = 0("23456|%3"37)
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "#|%
state, %
DeepNN
! "&|%
! "'|%
()*∈,
! "-|% = 1
! "|% = 0("23456|%3"37)
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Training
1. Run a policy for a while2. Increase probability of actions that lead to high
rewards3. Decrease probability of actions that lead to
low/no rewards
function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇*log )* .:|%: <:! ← ! + >∇
return !
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Training
1. Run a policy for a while2. Increase probability of actions that lead to high
rewards3. Decrease probability of actions that lead to
low/no rewards
function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇789: ;7 <=|?= @=! ← ! + B∇
return !
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
Policy Gradient (PG): Training
1. Run a policy for a while2. Increase probability of actions that lead to high
rewards3. Decrease probability of actions that lead to
low/no rewards
function REINFORCEInitialize !for "#$%&'" ~ )*{%,, .,, /,},12342 ← "#$%&'"for t = 1 to T-1∇ ← ∇789: ;7 <=|?= @=! ← ! + B∇
return !
∇*log )* .F|%F GFlog-likelihood of action
reward
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
The Game of Go
Board Size n x n Positions 3$% % Legal Legal Positions
1×1 3 33.33% 12×2 81 70.37% 573×3 19,683 64.40% 12,6754×4 43,046,721 56.49% 24,318,1655×5 847,288,609,443 48.90% 414,295,148,7419×9 4.434264882×1038 23.44% 1.03919148791×1038
13×13 4.300233593×1080 8.66% 3.72497923077×1079
19×19 1.740896506×10172 1.20% 2.08168199382×10170
Greater number of legal board positions than atoms in the universe.
Aim: Get more board territory than your opponent.
Source: Wikipedia.
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.
1) Initial training: human data
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.
1) Initial training: human data
2) Self-play and reinforcement learningà super-human performance
6.S191 Introduction to Deep Learningintrotodeeplearning.com 1/30/19
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.
1) Initial training: human data
2) Self-play and reinforcement learningà super-human performance
3) “Intuition” about board state