human-level control through deep...
TRANSCRIPT
-
Human-level control through deep reinforcement
Liia Butler
-
But first... A quote"The question of whether machines can think... is about as relevant as the question of whether submarines can swim"
Edsger W. Dijkstra
-
Overview1. Introduction
2. Reinforcement Learning
3. Deep neural networks
4. Markov Decision Process
5. Algorithm Breakdown
6. Evaluation and conclusions
-
IntroductionDeep Q-network (DQN)
- The agent - Reinforcement learning plus- Deep neural networks- Goal: General artificial intelligence
- How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks?
- Pixels and game score as input
-
Reinforcement Learning- Theory of how software agents may optimize their control of the environment- Inspired by the psychological and neuroscientific perspectives on animal
behavior - One of the three types of machine learning
http://en.proft.me/media/science/ml_types.png
-
Space Invaders
http://www.youtube.com/watch?v=ZisFfiEdQ_E
-
Deep Neural Networks- An architecture in deep learning, type of artificial neural network
- Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system
- Multiple layers of nodes with increasing abstraction of the data
- Extract high-level representations from raw data- DQN uses "deep convolutional network"
- 84 x 4 x 4 image produced by preprocessing map- three convolutional layers- Two fully connected layers
-
http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg
http://www.nature.com/nature/journal/v518/n7540/images/nature14236-f4.jpg
-
- State - Action- Reward
Markov Decision Process
http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg
-
What these mean for DQN- State - What is going on?
- The goal was to be universal so it's represented by screen pixels
- Action - What can we do?- Ex. moving, direction, buttons
- Reward - What's our motivation?- Points, lives, etc.
http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png
-
How is DQN going to do this?- Preprocessing - Reduce input dimensionality, max value for pixel color,
remove flickering- ε-greedy policy - choosing the action- Bellman equation - optimal control of environment, action-value function
- Using a function approximator to estimate the action-value function
- Loss function and Q-learning gradient- Experience replay - building a data set from agent's experience
-
Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageΦ =preprocessing sequenceT = time-step at which game terminatesε = probability in ε-greedy policya = actions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q
-
Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q
ε-greedy policy
-
ε-greedy policy
- Exploration, random- Exploitation, best one according to the Q value
How to choose the action 'a' at time 't'
-
Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing functions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q
ExperienceReplay
-
Experience Replay1. Take action2. Store transition in memory3. Sample random minibatch of transitions from D4. Optimize using gradient descent on target 'y' and Q-network
-
Optimizing the Q-Network- Bellman Equation: - The loss function we have:
- From this:
- Gives us the Q-learning gradient:
-
Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weight for approximatorM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q
-
Breakout!
http://www.youtube.com/watch?v=iqXKQf2BOSE
-
Evaluation and Conclusions- Agents vs. Pro gamers
- Action at 10 Hz (an action every 0.1 seconds), every 6th frame- At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance - Controlled human conditions
- Out of the 49 games- 29 at human or above- 20 below
http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg
-
http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f3.jpg
29 out of 49
20 out of 49
-
Questions and Discussion- What do you think are some non-gaming applications of deep reinforcement
learning?- Do you think that comparing with the "professional human game tester" is a
sufficient enough of an evaluation? Is there a better way?- Should we even have a general AI, or are we better off with domain specific
AIs?- Are there other consequences besides a computer beating your high score?
(Have we doomed society?)