human-level control through deep...

21
Human-level control through deep reinforcement Liia Butler

Upload: others

Post on 05-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Human-level control through deep reinforcement

    Liia Butler

  • But first... A quote"The question of whether machines can think... is about as relevant as the question of whether submarines can swim"

    Edsger W. Dijkstra

  • Overview1. Introduction

    2. Reinforcement Learning

    3. Deep neural networks

    4. Markov Decision Process

    5. Algorithm Breakdown

    6. Evaluation and conclusions

  • IntroductionDeep Q-network (DQN)

    - The agent - Reinforcement learning plus- Deep neural networks- Goal: General artificial intelligence

    - How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks?

    - Pixels and game score as input

  • Reinforcement Learning- Theory of how software agents may optimize their control of the environment- Inspired by the psychological and neuroscientific perspectives on animal

    behavior - One of the three types of machine learning

    http://en.proft.me/media/science/ml_types.png

  • Space Invaders

    http://www.youtube.com/watch?v=ZisFfiEdQ_E

  • Deep Neural Networks- An architecture in deep learning, type of artificial neural network

    - Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system

    - Multiple layers of nodes with increasing abstraction of the data

    - Extract high-level representations from raw data- DQN uses "deep convolutional network"

    - 84 x 4 x 4 image produced by preprocessing map- three convolutional layers- Two fully connected layers

    -

    http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg

    http://www.nature.com/nature/journal/v518/n7540/images/nature14236-f4.jpg

  • - State - Action- Reward

    Markov Decision Process

    http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg

  • What these mean for DQN- State - What is going on?

    - The goal was to be universal so it's represented by screen pixels

    - Action - What can we do?- Ex. moving, direction, buttons

    - Reward - What's our motivation?- Points, lives, etc.

    http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png

  • How is DQN going to do this?- Preprocessing - Reduce input dimensionality, max value for pixel color,

    remove flickering- ε-greedy policy - choosing the action- Bellman equation - optimal control of environment, action-value function

    - Using a function approximator to estimate the action-value function

    - Loss function and Q-learning gradient- Experience replay - building a data set from agent's experience

  • Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageΦ =preprocessing sequenceT = time-step at which game terminatesε = probability in ε-greedy policya = actions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

  • Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

    ε-greedy policy

  • ε-greedy policy

    - Exploration, random- Exploitation, best one according to the Q value

    How to choose the action 'a' at time 't'

  • Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weightM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing functions = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

    ExperienceReplay

  • Experience Replay1. Take action2. Store transition in memory3. Sample random minibatch of transitions from D4. Optimize using gradient descent on target 'y' and Q-network

  • Optimizing the Q-Network- Bellman Equation: - The loss function we have:

    - From this:

    - Gives us the Q-learning gradient:

  • Algorithm Breakdown KeyD = Memory, or data set N = Number of experience tuples in replay memoryQ = "quality" functionΘ = The weight for approximatorM = Number of episodess = sequencex = observation/imageT = time-step at which game terminatesε = probability in ε-greedy policya = actionΦ =preprocessing sequences = statey = targetr = rewardν = reward discount factorC = Number of updates to Q

  • Breakout!

    http://www.youtube.com/watch?v=iqXKQf2BOSE

  • Evaluation and Conclusions- Agents vs. Pro gamers

    - Action at 10 Hz (an action every 0.1 seconds), every 6th frame- At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance - Controlled human conditions

    - Out of the 49 games- 29 at human or above- 20 below

    http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg

  • http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f3.jpg

    29 out of 49

    20 out of 49

  • Questions and Discussion- What do you think are some non-gaming applications of deep reinforcement

    learning?- Do you think that comparing with the "professional human game tester" is a

    sufficient enough of an evaluation? Is there a better way?- Should we even have a general AI, or are we better off with domain specific

    AIs?- Are there other consequences besides a computer beating your high score?

    (Have we doomed society?)