deep reinforcement learning

19
Deep RL Abecon 20.05.2016

Upload: leszek-rybicki

Post on 16-Apr-2017

188 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Deep Reinforcement Learning

Deep RLAbecon 20.05.2016

Page 2: Deep Reinforcement Learning

RL-type Problems

• game of chess, GO, Space Invaders

• balancing a unicycle

• investing in stock market

• running a business

• making fast food

• life…!

Page 3: Deep Reinforcement Learning

Markov Decision Process

• S - set of states

• A - set of actions (or actions for state)

• P(s, s’ | a) - state change

• R(s, s’ | a) - reward

• ∈ [0, 1] - discount factor

Page 4: Deep Reinforcement Learning
Page 5: Deep Reinforcement Learning

Maximize the total discounted reward:

The GOAL

Page 6: Deep Reinforcement Learning

: discount factor0 instant gratification

patience1

Page 7: Deep Reinforcement Learning

Value Functions

http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Page 10: Deep Reinforcement Learning

Reinforce.jshttp://cs.stanford.edu/people/karpathy/reinforcejs/

// create an environment objectvar env = {};env.getNumStates = function() { return 8; }env.getMaxNumActions = function() { return 4; }

// create the DQN agentvar spec = { alpha: 0.01 } agent = new RL.DQNAgent(env, spec);

setInterval(function(){ // start the learning loop var action = agent.act(s); // s is an array of length 8 agent.learn(reward);}, 0);

Page 11: Deep Reinforcement Learning

2013: Deep RL

http://arxiv.org/abs/1312.5602

Page 12: Deep Reinforcement Learning

2014: Google buys DeepMind

Page 13: Deep Reinforcement Learning

2015: AlphaGO

Page 14: Deep Reinforcement Learning

Deep Q-Learning1. Do a feedforward pass for the current state s to get predicted Q-values

for all actions.

2. Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’).

3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.

4. Update the weights using backpropagation.

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Page 15: Deep Reinforcement Learning

Deep Q-Learning

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Page 16: Deep Reinforcement Learning

https://www.youtube.com/watch?v=32y3_iyHpBc

http://gabrielecirulli.github.io/2048/

Page 17: Deep Reinforcement Learning

Asynchronous Gradient Descent

http://arxiv.org/abs/1602.01783

Page 18: Deep Reinforcement Learning

http://www.rethinkrobotics.com/baxter/