human-level control through deep...

Human-level control through deep reinforcement

Liia Butler

But first... A quote"The question of whether machines can think... is about as relevant as the question of whether submarines can swim"

Edsger W. Dijkstra

Overview1. Introduction

2. Reinforcement Learning

3. Deep neural networks

4. Markov Decision Process

5. Algorithm Breakdown

6. Evaluation and conclusions

IntroductionDeep Q-network (DQN)

- The agent - Reinforcement learning plus- Deep neural networks- Goal: General artificial intelligence

- How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks?

- Pixels and game score as input

Reinforcement Learning- Theory of how software agents may optimize their control of the environment- Inspired by the psychological and neuroscientific perspectives on animal

behavior - One of the three types of machine learning

http://en.proft.me/media/science/ml_types.png

Space Invaders

http://www.youtube.com/watch?v=ZisFfiEdQ_E

Deep Neural Networks- An architecture in deep learning, type of artificial neural network

- Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system

- Multiple layers of nodes with increasing abstraction of the data

- Extract high-level representations from raw data- DQN uses "deep convolutional network"

- 84 x 4 x 4 image produced by preprocessing map- three convolutional layers- Two fully connected layers

-

http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg

http://www.nature.com/nature/journal/v518/n7540/images/nature14236-f4.jpg

- State - Action- Reward

Markov Decision Process

http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg

What these mean for DQN- State - What is going on?

- The goal was to be universal so it's represented by screen pixels

- Action - What can we do?- Ex. moving, direction, buttons

- Reward - What's our motivation?- Points, lives, etc.

http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png

How is DQN going to do this?- Preprocessing - Reduce input dimensionality, max value for pixel color,

remove flickering- ε-greedy policy - choosing the action- Bellman equation - optimal control of environment, action-value function

- Using a function approximator to estimate the action-value function

- Loss function and Q-learning gradient- Experience replay - building a data set from agent's experience

Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageΦ =preprocessing sequenceT = time-step at which game terminatesε = probability in ε-greedy policya = actions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

ε-greedy policy

ε-greedy policy

- Exploration, random- Exploitation, best one according to the Q value

How to choose the action 'a' at time 't'

Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing functions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

ExperienceReplay

Experience Replay1. Take action2. Store transition in memory3. Sample random minibatch of transitions from D4. Optimize using gradient descent on target 'y' and Q-network

Optimizing the Q-Network- Bellman Equation: - The loss function we have:

- From this:

- Gives us the Q-learning gradient:

Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weight for approximatorM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

Breakout!

http://www.youtube.com/watch?v=iqXKQf2BOSE

Evaluation and Conclusions- Agents vs. Pro gamers

- Action at 10 Hz (an action every 0.1 seconds), every 6th frame- At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance - Controlled human conditions

- Out of the 49 games- 29 at human or above- 20 below

http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg

http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f3.jpg

29 out of 49

20 out of 49

Questions and Discussion- What do you think are some non-gaming applications of deep reinforcement

learning?- Do you think that comparing with the "professional human game tester" is a

sufficient enough of an evaluation? Is there a better way?- Should we even have a general AI, or are we better off with domain specific

AIs?- Are there other consequences besides a computer beating your high score?

(Have we doomed society?)

human-level control through deep...

Documents